Website hosting service by Active-Venture.com
  

 Back to Index

Using independent subexpressions to prevent backtracking

The last few extended patterns in this tutorial are experimental as of 5.6.0. Play with them, use them in some code, but don't rely on them just yet for production code.

Independent subexpressions  are regular expressions, in the context of a larger regular expression, that function independently of the larger regular expression. That is, they consume as much or as little of the string as they wish without regard for the ability of the larger regexp to match. Independent subexpressions are represented by (?>regexp). We can illustrate their behavior by first considering an ordinary regexp:

 
    $x = "ab";
    $x =~ /a*ab/;  # matches  

This obviously matches, but in the process of matching, the subexpression a* first grabbed the a. Doing so, however, wouldn't allow the whole regexp to match, so after backtracking, a* eventually gave back the a and matched the empty string. Here, what a* matched was dependent on what the rest of the regexp matched.

Contrast that with an independent subexpression:

 
    $x =~ /(?>a*)ab/;  # doesn't match!  

The independent subexpression (?>a*) doesn't care about the rest of the regexp, so it sees an a and grabs it. Then the rest of the regexp ab cannot match. Because (?>a*) is independent, there is no backtracking and the independent subexpression does not give up its a. Thus the match of the regexp as a whole fails. A similar behavior occurs with completely independent regexps:

 
    $x = "ab";
    $x =~ /a*/g;   # matches, eats an 'a'
    $x =~ /\Gab/g; # doesn't match, no 'a' available  

Here //g and \G create a 'tag team' handoff of the string from one regexp to the other. Regexps with an independent subexpression are much like this, with a handoff of the string to the independent subexpression, and a handoff of the string back to the enclosing regexp.

The ability of an independent subexpression to prevent backtracking can be quite useful. Suppose we want to match a non-empty string enclosed in parentheses up to two levels deep. Then the following regexp matches:

 
    $x = "abc(de(fg)h";  # unbalanced parentheses
    $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x;  

The regexp matches an open parenthesis, one or more copies of an alternation, and a close parenthesis. The alternation is two-way, with the first alternative [^()]+ matching a substring with no parentheses and the second alternative \([^()]*\) matching a substring delimited by parentheses. The problem with this regexp is that it is pathological: it has nested indeterminate quantifiers of the form (a+|b)+. We discussed in Part 1 how nested quantifiers like this could take an exponentially long time to execute if there was no match possible. To prevent the exponential blowup, we need to prevent useless backtracking at some point. This can be done by enclosing the inner quantifier as an independent subexpression:

 
    $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x;  

Here, (?>[^()]+) breaks the degeneracy of string partitioning by gobbling up as much of the string as possible and keeping it. Then match failures fail much more quickly.

 

 

 

Domain name registration service & domain search - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ It can be shown that for any nutty theory, beyond-the-fringe political view or strange religion there exists a proponent on the Net. The proof is left as an exercise for your kill-file.   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com