|
The examples in the previous section display an annoying weakness. We were only matching
3-letter words, or syllables of 4 letters or less. We'd like to be able to match words or
syllables of any length, without writing out tedious alternatives like \w\w\w\w|\w\w\w|\w\w|\w.
This is exactly the problem the quantifier metacharacters ?, *,
+, and {} were created for. They allow us to determine the number of
repeats of a portion of a regexp we consider to be a match. Quantifiers are put immediately
after the character, character class, or grouping that we want to specify. They have the
following meanings:
a? = match 'a' 1 or 0 times
a* = match 'a' 0 or more times, i.e., any number of times
a+ = match 'a' 1 or more times, i.e., at least once
a{n,m} = match at least n times, but not more than m
times.
a{n,} = match at least n or more times
a{n} = match exactly n times
Here are some examples:
/[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
# any number of digits
/(\w+)\s+\1/; # match doubled words of arbitrary length
/y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes'
$year =~ /\d{2,4}/; # make sure year is at least 2 but not more
# than 4 digits
$year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates
$year =~ /\d{2}(\d{2})?/; # same thing written differently. However,
# this produces $1 and the other does not.
% simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier?
beriberi
booboo
coco
mama
murmur
papa
|
|
For all of these quantifiers, perl will try to match as much of the string as possible, while
still allowing the regexp to succeed. Thus with /a?.../, perl will first try to
match the regexp with the a present; if that fails, perl will try to match the
regexp without the a present. For the quantifier *, we get the
following:
$x = "the cat in the hat";
$x =~ /^(.*)(cat)(.*)$/; # matches,
# $1 = 'the '
# $2 = 'cat'
# $3 = ' in the hat'
|
|
Which is what we might expect, the match finds the only cat in the string and
locks onto it. Consider, however, this regexp:
$x =~ /^(.*)(at)(.*)$/; # matches,
# $1 = 'the cat in the h'
# $2 = 'at'
# $3 = '' (0 matches)
|
|
One might initially guess that perl would find the at in cat and
stop there, but that wouldn't give the longest possible string to the first quantifier .*.
Instead, the first quantifier .* grabs as much of the string as possible while
still having the regexp match. In this example, that means having the at sequence
with the final at in the string. The other important principle illustrated here is
that when there are two or more elements in a regexp, the leftmost quantifier, if there
is one, gets to grab as much the string as possible, leaving the rest of the regexp to fight
over scraps. Thus in our example, the first quantifier .* grabs most of the string,
while the second quantifier .* gets the empty string. Quantifiers that grab as much
of the string as possible are called maximal match or greedy quantifiers.
When a regexp can match a string in several different ways, we can use the principles above
to predict which way the regexp will match:
- Principle 0: Taken as a whole, any regexp will be matched at the earliest possible
position in the string.
- Principle 1: In an alternation
a|b|c..., the leftmost alternative that allows
a match for the whole regexp will be the one used.
- Principle 2: The maximal matching quantifiers
?, *, +
and {n,m} will in general match as much of the string as possible while still
allowing the whole regexp to match.
- Principle 3: If there are two or more elements in a regexp, the leftmost greedy
quantifier, if any, will match as much of the string as possible while still allowing the
whole regexp to match. The next leftmost greedy quantifier, if any, will try to match as
much of the string remaining available to it as possible, while still allowing the whole
regexp to match. And so on, until all the regexp elements are satisfied.
As we have seen above, Principle 0 overrides the others - the regexp will be matched as early
as possible, with the other principles determining how the regexp matches at that earliest
character position.
Here is an example of these principles in action:
$x = "The programming republic of Perl";
$x =~ /^(.+)(e|r)(.*)$/; # matches,
# $1 = 'The programming republic of Pe'
# $2 = 'r'
# $3 = 'l'
|
|
This regexp matches at the earliest string position, 'T'. One might think that e,
being leftmost in the alternation, would be matched, but r produces the longest
string in the first quantifier.
$x =~ /(m{1,2})(.*)$/; # matches,
# $1 = 'mm'
# $2 = 'ing republic of Perl'
|
|
Here, The earliest possible match is at the first 'm' in programming.
m{1,2} is the first quantifier, so it gets to match a maximal mm.
$x =~ /.*(m{1,2})(.*)$/; # matches,
# $1 = 'm'
# $2 = 'ing republic of Perl'
|
|
Here, the regexp matches at the start of the string. The first quantifier .*
grabs as much as possible, leaving just a single 'm' for the second quantifier m{1,2}.
$x =~ /(.?)(m{1,2})(.*)$/; # matches,
# $1 = 'a'
# $2 = 'mm'
# $3 = 'ing republic of Perl'
|
|
Here, .? eats its maximal one character at the earliest possible position in the
string, 'a' in programming, leaving m{1,2} the
opportunity to match both m's. Finally,
"aXXXb" =~ /(X*)/; # matches with $1 = ''
|
|
because it can match zero copies of 'X' at the beginning of the string. If you
definitely want to match at least one 'X', use X+, not X*.
Sometimes greed is not good. At times, we would like quantifiers to match a minimal
piece of string, rather than a maximal piece. For this purpose, Larry Wall created the minimal match
or non-greedy quantifiers ??,*?, +?, and {}?.
These are the usual quantifiers with a ? appended to them. They have the following
meanings:
a?? = match 'a' 0 or 1 times. Try 0 first, then 1.
a*? = match 'a' 0 or more times, i.e., any number of times, but as few times
as possible
a+? = match 'a' 1 or more times, i.e., at least once, but as few times as
possible
a{n,m}? = match at least n times, not more than m
times, as few times as possible
a{n,}? = match at least n times, but as few times as possible
a{n}? = match exactly n times. Because we match exactly n
times, a{n}? is equivalent to a{n} and is just there for
notational consistency.
Let's look at the example above, but with minimal quantifiers:
$x = "The programming republic of Perl";
$x =~ /^(.+?)(e|r)(.*)$/; # matches,
# $1 = 'Th'
# $2 = 'e'
# $3 = ' programming republic of Perl'
|
|
The minimal string that will allow both the start of the string ^ and the
alternation to match is Th, with the alternation e|r matching e.
The second quantifier .* is free to gobble up the rest of the string.
$x =~ /(m{1,2}?)(.*?)$/; # matches,
# $1 = 'm'
# $2 = 'ming republic of Perl'
|
|
The first string position that this regexp can match is at the first 'm' in programming.
At this position, the minimal m{1,2}? matches just one 'm'. Although
the second quantifier .*? would prefer to match no characters, it is constrained by
the end-of-string anchor $ to match the rest of the string.
$x =~ /(.*?)(m{1,2}?)(.*)$/; # matches,
# $1 = 'The progra'
# $2 = 'm'
# $3 = 'ming republic of Perl'
|
|
In this regexp, you might expect the first minimal quantifier .*? to match the
empty string, because it is not constrained by a ^ anchor to match the beginning of
the word. Principle 0 applies here, however. Because it is possible for the whole regexp to
match at the start of the string, it will match at the start of the string. Thus the
first quantifier has to match everything up to the first m. The second minimal
quantifier matches just one m and the third quantifier matches the rest of the
string.
$x =~ /(.??)(m{1,2})(.*)$/; # matches,
# $1 = 'a'
# $2 = 'mm'
# $3 = 'ing republic of Perl'
|
|
Just as in the previous regexp, the first quantifier .?? can match earliest at
position 'a', so it does. The second quantifier is greedy, so it matches mm,
and the third matches the rest of the string.
We can modify principle 3 above to take into account non-greedy quantifiers:
- Principle 3: If there are two or more elements in a regexp, the leftmost greedy
(non-greedy) quantifier, if any, will match as much (little) of the string as possible while
still allowing the whole regexp to match. The next leftmost greedy (non-greedy) quantifier,
if any, will try to match as much (little) of the string remaining available to it as
possible, while still allowing the whole regexp to match. And so on, until all the regexp
elements are satisfied.
Just like alternation, quantifiers are also susceptible to backtracking. Here is a
step-by-step analysis of the example
$x = "the cat in the hat";
$x =~ /^(.*)(at)(.*)$/; # matches,
# $1 = 'the cat in the h'
# $2 = 'at'
# $3 = '' (0 matches)
|
|
- Start with the first letter in the string 't'.
- The first quantifier '.*' starts out by matching the whole string 'the cat in the hat'.
- 'a' in the regexp element 'at' doesn't match the end of the string. Backtrack one
character.
- 'a' in the regexp element 'at' still doesn't match the last letter of the string 't', so
backtrack one more character.
- Now we can match the 'a' and the 't'.
- Move on to the third element '.*'. Since we are at the end of the string and '.*' can
match 0 times, assign it the empty string.
- We are done!
Most of the time, all this moving forward and backtracking happens quickly and searching is
fast. There are some pathological regexps, however, whose execution time exponentially grows
with the size of the string. A typical structure that blows up in your face is of the form
The problem is the nested indeterminate quantifiers. There are many different ways of
partitioning a string of length n between the + and *: one repetition
with b+ of length n, two repetitions with the first b+ length k and
the second with length n-k, m repetitions whose bits add up to length n, etc. In fact there are
an exponential number of ways to partition a string as a function of length. A regexp may get
lucky and match early in the process, but if there is no match, perl will try every
possibility before giving up. So be careful with nested *'s, {n,m}'s,
and +'s. The book Mastering regular expressions by Jeffrey Friedl gives a
wonderful discussion of this and other efficiency issues.
|
|