Website hosting service by Active-Venture.com
  

 Back to Index

Building a regexp

At this point, we have all the basic regexp concepts covered, so let's give a more involved example of a regular expression. We will build a regexp that matches numbers.

The first task in building a regexp is to decide what we want to match and what we want to exclude. In our case, we want to match both integers and floating point numbers and we want to reject any string that isn't a number.

The next task is to break the problem down into smaller problems that are easily converted into a regexp.

The simplest case is integers. These consist of a sequence of digits, with an optional sign in front. The digits we can represent with \d+ and the sign can be matched with [+-]. Thus the integer regexp is

 
    /[+-]?\d+/;  # matches integers  

A floating point number potentially has a sign, an integral part, a decimal point, a fractional part, and an exponent. One or more of these parts is optional, so we need to check out the different possibilities. Floating point numbers which are in proper form include 123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out front is completely optional and can be matched by [+-]?. We can see that if there is no exponent, floating point numbers must have a decimal point, otherwise they are integers. We might be tempted to model these with \d*\.\d*, but this would also match just a single decimal point, which is not a number. So the three cases of floating point number sans exponent are

 
   /[+-]?\d+\./;  # 1., 321., etc.
   /[+-]?\.\d+/;  # .1, .234, etc.
   /[+-]?\d+\.\d+/;  # 1.0, 30.56, etc.  

These can be combined into a single regexp with a three-way alternation:

 
   /[+-]?(\d+\.\d+|\d+\.|\.\d+)/;  # floating point, no exponent  

In this alternation, it is important to put '\d+\.\d+' before '\d+\.'. If '\d+\.' were first, the regexp would happily match that and ignore the fractional part of the number.

Now consider floating point numbers with exponents. The key observation here is that both integers and numbers with decimal points are allowed in front of an exponent. Then exponents, like the overall sign, are independent of whether we are matching numbers with or without decimal points, and can be 'decoupled' from the mantissa. The overall form of the regexp now becomes clear:

 
    /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;  

The exponent is an e or E, followed by an integer. So the exponent regexp is

 
   /[eE][+-]?\d+/;  # exponent  

Putting all the parts together, we get a regexp that matches numbers:

 
   /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/;  # Ta da!  

Long regexps like this may impress your friends, but can be hard to decipher. In complex situations like this, the //x modifier for a match is invaluable. It allows one to put nearly arbitrary whitespace and comments into a regexp without affecting their meaning. Using it, we can rewrite our 'extended' regexp in the more pleasing form

 
   /^
      [+-]?         # first, match an optional sign
      (             # then match integers or f.p. mantissas:
          \d+\.\d+  # mantissa of the form a.b
         |\d+\.     # mantissa of the form a.
         |\.\d+     # mantissa of the form .b
         |\d+       # integer of the form a
      )
      ([eE][+-]?\d+)?  # finally, optionally match an exponent
   $/x;  

If whitespace is mostly irrelevant, how does one include space characters in an extended regexp? The answer is to backslash it '\ '  or put it in a character class [ ] . The same thing goes for pound signs, use \# or [#]. For instance, Perl allows a space between the sign and the mantissa/integer, and we could add this to our regexp as follows:

 
   /^
      [+-]?\ *      # first, match an optional sign *and space*
      (             # then match integers or f.p. mantissas:
          \d+\.\d+  # mantissa of the form a.b
         |\d+\.     # mantissa of the form a.
         |\.\d+     # mantissa of the form .b
         |\d+       # integer of the form a
      )
      ([eE][+-]?\d+)?  # finally, optionally match an exponent
   $/x;  

In this form, it is easier to see a way to simplify the alternation. Alternatives 1, 2, and 4 all start with \d+, so it could be factored out:

 
   /^
      [+-]?\ *      # first, match an optional sign
      (             # then match integers or f.p. mantissas:
          \d+       # start out with a ...
          (
              \.\d* # mantissa of the form a.b or a.
          )?        # ? takes care of integers of the form a
         |\.\d+     # mantissa of the form .b
      )
      ([eE][+-]?\d+)?  # finally, optionally match an exponent
   $/x;  

or written in the compact form,

 
    /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/;  

This is our final regexp. To recap, we built a regexp by

  • specifying the task in detail,
  • breaking down the problem into smaller parts,
  • translating the small parts into regexps,
  • combining the regexps,
  • and optimizing the final combined regexp.

These are also the typical steps involved in writing a computer program. This makes perfect sense, because regular expressions are essentially programs written a little computer language that specifies patterns.

 

 

 

Domain name registration service & domain search - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or repair.   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com