Website hosting service by Active-Venture.com
  

 Back to Index

A bit of magic: executing Perl code in a regular expression

Normally, regexps are a part of Perl expressions. Code evaluation  expressions turn that around by allowing arbitrary Perl code to be a part of a regexp. A code evaluation expression is denoted (?{code}), with code a string of Perl statements.

Code expressions are zero-width assertions, and the value they return depends on their environment. There are two possibilities: either the code expression is used as a conditional in a conditional expression (?(condition)...), or it is not. If the code expression is a conditional, the code is evaluated and the result (i.e., the result of the last statement) is used to determine truth or falsehood. If the code expression is not used as a conditional, the assertion always evaluates true and the result is put into the special variable $^R. The variable $^R can then be used in code expressions later in the regexp. Here are some silly examples:

 
    $x = "abcdef";
    $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
                                         # prints 'Hi Mom!'
    $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
                                         # no 'Hi Mom!'  

Pay careful attention to the next example:

 
    $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
                                         # no 'Hi Mom!'
                                         # but why not?  

At first glance, you'd think that it shouldn't print, because obviously the ddd isn't going to match the target string. But look at this example:

 
    $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
                                           # but _does_ print  

Hmm. What happened here? If you've been following along, you know that the above pattern should be effectively the same as the last one -- enclosing the d in a character class isn't going to change what it matches. So why does the first not print while the second one does?

The answer lies in the optimizations the REx engine makes. In the first case, all the engine sees are plain old characters (aside from the ?{} construct). It's smart enough to realize that the string 'ddd' doesn't occur in our target string before actually running the pattern through. But in the second case, we've tricked it into thinking that our pattern is more complicated than it is. It takes a look, sees our character class, and decides that it will have to actually run the pattern to determine whether or not it matches, and in the process of running it hits the print statement before it discovers that we don't have a match.

To take a closer look at how the engine does optimizations, see the section "Pragmas and debugging" below.

More fun with ?{}:

 
    $x =~ /(?{print "Hi Mom!";})/;       # matches,
                                         # prints 'Hi Mom!'
    $x =~ /(?{$c = 1;})(?{print "$c";})/;  # matches,
                                           # prints '1'
    $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
                                           # prints '1'  

The bit of magic mentioned in the section title occurs when the regexp backtracks in the process of searching for a match. If the regexp backtracks over a code expression and if the variables used within are localized using local, the changes in the variables produced by the code expression are undone! Thus, if we wanted to count how many times a character got matched inside a group, we could use, e.g.,

 
    $x = "aaaa";
    $count = 0;  # initialize 'a' count
    $c = "bob";  # test if $c gets clobbered
    $x =~ /(?{local $c = 0;})         # initialize count
           ( a                        # match 'a'
             (?{local $c = $c + 1;})  # increment count
           )*                         # do this any number of times,
           aa                         # but match 'aa' at the end
           (?{$count = $c;})          # copy local $c var into $count
          /x;
    print "'a' count is $count, \$c variable is '$c'\n";  

This prints

 
    'a' count is 2, $c variable is 'bob'  

If we replace the  (?{local $c = $c + 1;})  with  (?{$c = $c + 1;}) , the variable changes are not undone during backtracking, and we get

 
    'a' count is 4, $c variable is 'bob'  

Note that only localized variable changes are undone. Other side effects of code expression execution are permanent. Thus

 
    $x = "aaaa";
    $x =~ /(a(?{print "Yow\n";}))*aa/;  

produces

 
   Yow
   Yow
   Yow
   Yow  

The result $^R is automatically localized, so that it will behave properly in the presence of backtracking.

This example uses a code expression in a conditional to match the article 'the' in either English or German:

 
    $lang = 'DE';  # use German
    ...
    $text = "das";
    print "matched\n"
        if $text =~ /(?(?{
                          $lang eq 'EN'; # is the language English?
                         })
                       the |             # if so, then match 'the'
                       (die|das|der)     # else, match 'die|das|der'
                     )
                    /xi;  

Note that the syntax here is (?(?{...})yes-regexp|no-regexp), not (?((?{...}))yes-regexp|no-regexp). In other words, in the case of a code expression, we don't need the extra parentheses around the conditional.

If you try to use code expressions with interpolating variables, perl may surprise you:

 
    $bar = 5;
    $pat = '(?{ 1 })';
    /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
    /foo(?{ 1 })$bar/;   # compile error!
    /foo${pat}bar/;      # compile error!

    $pat = qr/(?{ $foo = 1 })/;  # precompile code regexp
    /foo${pat}bar/;      # compiles ok  

If a regexp has (1) code expressions and interpolating variables,or (2) a variable that interpolates a code expression, perl treats the regexp as an error. If the code expression is precompiled into a variable, however, interpolating is ok. The question is, why is this an error?

The reason is that variable interpolation and code expressions together pose a security risk. The combination is dangerous because many programmers who write search engines often take user input and plug it directly into a regexp:

 
    $regexp = <>;       # read user-supplied regexp
    $chomp $regexp;     # get rid of possible newline
    $text =~ /$regexp/; # search $text for the $regexp  

If the $regexp variable contains a code expression, the user could then execute arbitrary Perl code. For instance, some joker could search for system('rm -rf *');  to erase your files. In this sense, the combination of interpolation and code expressions taints your regexp. So by default, using both interpolation and code expressions in the same regexp is not allowed. If you're not concerned about malicious users, it is possible to bypass this security check by invoking use re 'eval' :

 
    use re 'eval';       # throw caution out the door
    $bar = 5;
    $pat = '(?{ 1 })';
    /foo(?{ 1 })$bar/;   # compiles ok
    /foo${pat}bar/;      # compiles ok  

Another form of code expression is the pattern code expression . The pattern code expression is like a regular code expression, except that the result of the code evaluation is treated as a regular expression and matched immediately. A simple example is

 
    $length = 5;
    $char = 'a';
    $x = 'aaaaabb';
    $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
  

This final example contains both ordinary and pattern code expressions. It detects if a binary string 1101010010001... has a Fibonacci spacing 0,1,1,2,3,5,... of the 1's:

 
    $s0 = 0; $s1 = 1; # initial conditions
    $x = "1101010010001000001";
    print "It is a Fibonacci sequence\n"
        if $x =~ /^1         # match an initial '1'
                    (
                       (??{'0' x $s0}) # match $s0 of '0'
                       1               # and then a '1'
                       (?{
                          $largest = $s0;   # largest seq so far
                          $s2 = $s1 + $s0;  # compute next term
                          $s0 = $s1;        # in Fibonacci sequence
                          $s1 = $s2;
                         })
                    )+   # repeat as needed
                  $      # that is all there is
                 /x;
    print "Largest sequence matched was $largest\n";  

This prints

 
    It is a Fibonacci sequence
    Largest sequence matched was 5  

Ha! Try that with your garden variety regexp package...

Note that the variables $s0 and $s1 are not substituted when the regexp is compiled, as happens for ordinary variables outside a code expression. Rather, the code expressions are evaluated when perl encounters them during the search for a match.

The regexp without the //x modifier is

 
    /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/;  

and is a great start on an Obfuscated Perl entry :-) When working with code and conditional expressions, the extended form of regexps is almost necessary in creating and debugging regexps.

 

 

 

Domain name registration service & domain search - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ Perfection is achieved not when you have nothing more to add, but when you have nothing left to take away.   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com