|
The last few extended patterns in this tutorial are experimental as of 5.6.0. Play with them,
use them in some code, but don't rely on them just yet for production code.
Independent subexpressions are regular expressions, in the context of a
larger regular expression, that function independently of the larger regular expression. That
is, they consume as much or as little of the string as they wish without regard for the ability
of the larger regexp to match. Independent subexpressions are represented by (?>regexp).
We can illustrate their behavior by first considering an ordinary regexp:
$x = "ab";
$x =~ /a*ab/; # matches
|
|
This obviously matches, but in the process of matching, the subexpression a*
first grabbed the a. Doing so, however, wouldn't allow the whole regexp to match,
so after backtracking, a* eventually gave back the a and matched the
empty string. Here, what a* matched was dependent on what the rest of the
regexp matched.
Contrast that with an independent subexpression:
$x =~ /(?>a*)ab/; # doesn't match!
|
|
The independent subexpression (?>a*) doesn't care about the rest of the
regexp, so it sees an a and grabs it. Then the rest of the regexp ab
cannot match. Because (?>a*) is independent, there is no backtracking and the
independent subexpression does not give up its a. Thus the match of the regexp as a
whole fails. A similar behavior occurs with completely independent regexps:
$x = "ab";
$x =~ /a*/g; # matches, eats an 'a'
$x =~ /\Gab/g; # doesn't match, no 'a' available
|
|
Here //g and \G create a 'tag team' handoff of the string from one
regexp to the other. Regexps with an independent subexpression are much like this, with a
handoff of the string to the independent subexpression, and a handoff of the string back to the
enclosing regexp.
The ability of an independent subexpression to prevent backtracking can be quite useful.
Suppose we want to match a non-empty string enclosed in parentheses up to two levels deep. Then
the following regexp matches:
$x = "abc(de(fg)h"; # unbalanced parentheses
$x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x;
|
|
The regexp matches an open parenthesis, one or more copies of an alternation, and a close
parenthesis. The alternation is two-way, with the first alternative [^()]+ matching
a substring with no parentheses and the second alternative \([^()]*\) matching a
substring delimited by parentheses. The problem with this regexp is that it is pathological: it
has nested indeterminate quantifiers of the form (a+|b)+. We discussed in Part 1
how nested quantifiers like this could take an exponentially long time to execute if there was
no match possible. To prevent the exponential blowup, we need to prevent useless backtracking at
some point. This can be done by enclosing the inner quantifier as an independent subexpression:
$x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x;
|
|
Here, (?>[^()]+) breaks the degeneracy of string partitioning by gobbling up
as much of the string as possible and keeping it. Then match failures fail much more quickly.
|