|
At this point, we have all the basic regexp concepts covered, so let's give a more involved
example of a regular expression. We will build a regexp that matches numbers.
The first task in building a regexp is to decide what we want to match and what we want to
exclude. In our case, we want to match both integers and floating point numbers and we want to
reject any string that isn't a number.
The next task is to break the problem down into smaller problems that are easily converted
into a regexp.
The simplest case is integers. These consist of a sequence of digits, with an optional sign
in front. The digits we can represent with \d+ and the sign can be matched with [+-].
Thus the integer regexp is
/[+-]?\d+/; # matches integers
|
|
A floating point number potentially has a sign, an integral part, a decimal point, a
fractional part, and an exponent. One or more of these parts is optional, so we need to check
out the different possibilities. Floating point numbers which are in proper form include 123.,
0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out front is completely optional and
can be matched by [+-]?. We can see that if there is no exponent, floating point
numbers must have a decimal point, otherwise they are integers. We might be tempted to model
these with \d*\.\d*, but this would also match just a single decimal point, which
is not a number. So the three cases of floating point number sans exponent are
/[+-]?\d+\./; # 1., 321., etc.
/[+-]?\.\d+/; # .1, .234, etc.
/[+-]?\d+\.\d+/; # 1.0, 30.56, etc.
|
|
These can be combined into a single regexp with a three-way alternation:
/[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent
|
|
In this alternation, it is important to put '\d+\.\d+' before '\d+\.'.
If '\d+\.' were first, the regexp would happily match that and ignore the
fractional part of the number.
Now consider floating point numbers with exponents. The key observation here is that both
integers and numbers with decimal points are allowed in front of an exponent. Then exponents,
like the overall sign, are independent of whether we are matching numbers with or without
decimal points, and can be 'decoupled' from the mantissa. The overall form of the regexp now
becomes clear:
/^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;
|
|
The exponent is an e or E, followed by an integer. So the exponent
regexp is
/[eE][+-]?\d+/; # exponent
|
|
Putting all the parts together, we get a regexp that matches numbers:
/^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da!
|
|
Long regexps like this may impress your friends, but can be hard to decipher. In complex
situations like this, the //x modifier for a match is invaluable. It allows one to
put nearly arbitrary whitespace and comments into a regexp without affecting their meaning.
Using it, we can rewrite our 'extended' regexp in the more pleasing form
/^
[+-]? # first, match an optional sign
( # then match integers or f.p. mantissas:
\d+\.\d+ # mantissa of the form a.b
|\d+\. # mantissa of the form a.
|\.\d+ # mantissa of the form .b
|\d+ # integer of the form a
)
([eE][+-]?\d+)? # finally, optionally match an exponent
$/x;
|
|
If whitespace is mostly irrelevant, how does one include space characters in an extended
regexp? The answer is to backslash it '\ ' or put it in a character
class [ ] . The same thing goes for pound signs, use \# or [#].
For instance, Perl allows a space between the sign and the mantissa/integer, and we could add
this to our regexp as follows:
/^
[+-]?\ * # first, match an optional sign *and space*
( # then match integers or f.p. mantissas:
\d+\.\d+ # mantissa of the form a.b
|\d+\. # mantissa of the form a.
|\.\d+ # mantissa of the form .b
|\d+ # integer of the form a
)
([eE][+-]?\d+)? # finally, optionally match an exponent
$/x;
|
|
In this form, it is easier to see a way to simplify the alternation. Alternatives 1, 2, and 4
all start with \d+, so it could be factored out:
/^
[+-]?\ * # first, match an optional sign
( # then match integers or f.p. mantissas:
\d+ # start out with a ...
(
\.\d* # mantissa of the form a.b or a.
)? # ? takes care of integers of the form a
|\.\d+ # mantissa of the form .b
)
([eE][+-]?\d+)? # finally, optionally match an exponent
$/x;
|
|
or written in the compact form,
/^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/;
|
|
This is our final regexp. To recap, we built a regexp by
- specifying the task in detail,
- breaking down the problem into smaller parts,
- translating the small parts into regexps,
- combining the regexps,
- and optimizing the final combined regexp.
These are also the typical steps involved in writing a computer program. This makes perfect
sense, because regular expressions are essentially programs written a little computer language
that specifies patterns.
|
|