Website hosting service by Active-Venture.com
  

 Back to Index

Matching repetitions

The examples in the previous section display an annoying weakness. We were only matching 3-letter words, or syllables of 4 letters or less. We'd like to be able to match words or syllables of any length, without writing out tedious alternatives like \w\w\w\w|\w\w\w|\w\w|\w.

This is exactly the problem the quantifier metacharacters ?, *, +, and {} were created for. They allow us to determine the number of repeats of a portion of a regexp we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings:

  • a? = match 'a' 1 or 0 times
  • a* = match 'a' 0 or more times, i.e., any number of times
  • a+ = match 'a' 1 or more times, i.e., at least once
  • a{n,m} = match at least n times, but not more than m times.
  • a{n,} = match at least n or more times
  • a{n} = match exactly n times

Here are some examples:

 
    /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
                     # any number of digits
    /(\w+)\s+\1/;    # match doubled words of arbitrary length
    /y(es)?/i;       # matches 'y', 'Y', or a case-insensitive 'yes'
    $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more
                         # than 4 digits
    $year =~ /\d{4}|\d{2}/;    # better match; throw out 3 digit dates
    $year =~ /\d{2}(\d{2})?/;  # same thing written differently. However,
                               # this produces $1 and the other does not.

    % simple_grep '^(\w+)\1$' /usr/dict/words   # isn't this easier?
    beriberi
    booboo
    coco
    mama
    murmur
    papa  

For all of these quantifiers, perl will try to match as much of the string as possible, while still allowing the regexp to succeed. Thus with /a?.../, perl will first try to match the regexp with the a present; if that fails, perl will try to match the regexp without the a present. For the quantifier *, we get the following:

 
    $x = "the cat in the hat";
    $x =~ /^(.*)(cat)(.*)$/; # matches,
                             # $1 = 'the '
                             # $2 = 'cat'
                             # $3 = ' in the hat'  

Which is what we might expect, the match finds the only cat in the string and locks onto it. Consider, however, this regexp:

 
    $x =~ /^(.*)(at)(.*)$/; # matches,
                            # $1 = 'the cat in the h'
                            # $2 = 'at'
                            # $3 = ''   (0 matches)  

One might initially guess that perl would find the at in cat and stop there, but that wouldn't give the longest possible string to the first quantifier .*. Instead, the first quantifier .* grabs as much of the string as possible while still having the regexp match. In this example, that means having the at sequence with the final at in the string. The other important principle illustrated here is that when there are two or more elements in a regexp, the leftmost quantifier, if there is one, gets to grab as much the string as possible, leaving the rest of the regexp to fight over scraps. Thus in our example, the first quantifier .* grabs most of the string, while the second quantifier .* gets the empty string. Quantifiers that grab as much of the string as possible are called maximal match or greedy quantifiers.

When a regexp can match a string in several different ways, we can use the principles above to predict which way the regexp will match:

  • Principle 0: Taken as a whole, any regexp will be matched at the earliest possible position in the string.
  • Principle 1: In an alternation a|b|c..., the leftmost alternative that allows a match for the whole regexp will be the one used.
  • Principle 2: The maximal matching quantifiers ?, *, + and {n,m} will in general match as much of the string as possible while still allowing the whole regexp to match.
  • Principle 3: If there are two or more elements in a regexp, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole regexp to match. The next leftmost greedy quantifier, if any, will try to match as much of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.

As we have seen above, Principle 0 overrides the others - the regexp will be matched as early as possible, with the other principles determining how the regexp matches at that earliest character position.

Here is an example of these principles in action:

 
    $x = "The programming republic of Perl";
    $x =~ /^(.+)(e|r)(.*)$/;  # matches,
                              # $1 = 'The programming republic of Pe'
                              # $2 = 'r'
                              # $3 = 'l'  

This regexp matches at the earliest string position, 'T'. One might think that e, being leftmost in the alternation, would be matched, but r produces the longest string in the first quantifier.

 
    $x =~ /(m{1,2})(.*)$/;  # matches,
                            # $1 = 'mm'
                            # $2 = 'ing republic of Perl'  

Here, The earliest possible match is at the first 'm' in programming. m{1,2} is the first quantifier, so it gets to match a maximal mm.

 
    $x =~ /.*(m{1,2})(.*)$/;  # matches,
                              # $1 = 'm'
                              # $2 = 'ing republic of Perl'  

Here, the regexp matches at the start of the string. The first quantifier .* grabs as much as possible, leaving just a single 'm' for the second quantifier m{1,2}.

 
    $x =~ /(.?)(m{1,2})(.*)$/;  # matches,
                                # $1 = 'a'
                                # $2 = 'mm'
                                # $3 = 'ing republic of Perl'  

Here, .? eats its maximal one character at the earliest possible position in the string, 'a' in programming, leaving m{1,2} the opportunity to match both m's. Finally,

 
    "aXXXb" =~ /(X*)/; # matches with $1 = ''  

because it can match zero copies of 'X' at the beginning of the string. If you definitely want to match at least one 'X', use X+, not X*.

Sometimes greed is not good. At times, we would like quantifiers to match a minimal piece of string, rather than a maximal piece. For this purpose, Larry Wall created the minimal match  or non-greedy quantifiers ??,*?, +?, and {}?. These are the usual quantifiers with a ? appended to them. They have the following meanings:

  • a?? = match 'a' 0 or 1 times. Try 0 first, then 1.
  • a*? = match 'a' 0 or more times, i.e., any number of times, but as few times as possible
  • a+? = match 'a' 1 or more times, i.e., at least once, but as few times as possible
  • a{n,m}? = match at least n times, not more than m times, as few times as possible
  • a{n,}? = match at least n times, but as few times as possible
  • a{n}? = match exactly n times. Because we match exactly n times, a{n}? is equivalent to a{n} and is just there for notational consistency.

Let's look at the example above, but with minimal quantifiers:

 
    $x = "The programming republic of Perl";
    $x =~ /^(.+?)(e|r)(.*)$/; # matches,
                              # $1 = 'Th'
                              # $2 = 'e'
                              # $3 = ' programming republic of Perl'  

The minimal string that will allow both the start of the string ^ and the alternation to match is Th, with the alternation e|r matching e. The second quantifier .* is free to gobble up the rest of the string.

 
    $x =~ /(m{1,2}?)(.*?)$/;  # matches,
                              # $1 = 'm'
                              # $2 = 'ming republic of Perl'  

The first string position that this regexp can match is at the first 'm' in programming. At this position, the minimal m{1,2}? matches just one 'm'. Although the second quantifier .*? would prefer to match no characters, it is constrained by the end-of-string anchor $ to match the rest of the string.

 
    $x =~ /(.*?)(m{1,2}?)(.*)$/;  # matches,
                                  # $1 = 'The progra'
                                  # $2 = 'm'
                                  # $3 = 'ming republic of Perl'  

In this regexp, you might expect the first minimal quantifier .*? to match the empty string, because it is not constrained by a ^ anchor to match the beginning of the word. Principle 0 applies here, however. Because it is possible for the whole regexp to match at the start of the string, it will match at the start of the string. Thus the first quantifier has to match everything up to the first m. The second minimal quantifier matches just one m and the third quantifier matches the rest of the string.

 
    $x =~ /(.??)(m{1,2})(.*)$/;  # matches,
                                 # $1 = 'a'
                                 # $2 = 'mm'
                                 # $3 = 'ing republic of Perl'  

Just as in the previous regexp, the first quantifier .?? can match earliest at position 'a', so it does. The second quantifier is greedy, so it matches mm, and the third matches the rest of the string.

We can modify principle 3 above to take into account non-greedy quantifiers:

  • Principle 3: If there are two or more elements in a regexp, the leftmost greedy (non-greedy) quantifier, if any, will match as much (little) of the string as possible while still allowing the whole regexp to match. The next leftmost greedy (non-greedy) quantifier, if any, will try to match as much (little) of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.

Just like alternation, quantifiers are also susceptible to backtracking. Here is a step-by-step analysis of the example

 
    $x = "the cat in the hat";
    $x =~ /^(.*)(at)(.*)$/; # matches,
                            # $1 = 'the cat in the h'
                            # $2 = 'at'
                            # $3 = ''   (0 matches)  

  1. Start with the first letter in the string 't'.
  2. The first quantifier '.*' starts out by matching the whole string 'the cat in the hat'.
  3. 'a' in the regexp element 'at' doesn't match the end of the string. Backtrack one character.
  4. 'a' in the regexp element 'at' still doesn't match the last letter of the string 't', so backtrack one more character.
  5. Now we can match the 'a' and the 't'.
  6. Move on to the third element '.*'. Since we are at the end of the string and '.*' can match 0 times, assign it the empty string.
  7. We are done!

Most of the time, all this moving forward and backtracking happens quickly and searching is fast. There are some pathological regexps, however, whose execution time exponentially grows with the size of the string. A typical structure that blows up in your face is of the form

 
    /(a|b+)*/;  

The problem is the nested indeterminate quantifiers. There are many different ways of partitioning a string of length n between the + and *: one repetition with b+ of length n, two repetitions with the first b+ length k and the second with length n-k, m repetitions whose bits add up to length n, etc. In fact there are an exponential number of ways to partition a string as a function of length. A regexp may get lucky and match early in the process, but if there is no match, perl will try every possibility before giving up. So be careful with nested *'s, {n,m}'s, and +'s. The book Mastering regular expressions by Jeffrey Friedl gives a wonderful discussion of this and other efficiency issues.

 

 

Domain name registration service & domain search - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ A language that doesn't have everything is actually easier to program in than some that do.   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com