|
Normally, regexps are a part of Perl expressions. Code evaluation
expressions turn that around by allowing arbitrary Perl code to be a part of a regexp. A code
evaluation expression is denoted (?{code}), with code a string of Perl
statements.
Code expressions are zero-width assertions, and the value they return depends on their
environment. There are two possibilities: either the code expression is used as a conditional in
a conditional expression (?(condition)...), or it is not. If the code expression is
a conditional, the code is evaluated and the result (i.e., the result of the last statement) is
used to determine truth or falsehood. If the code expression is not used as a conditional, the
assertion always evaluates true and the result is put into the special variable $^R.
The variable $^R can then be used in code expressions later in the regexp. Here are
some silly examples:
$x = "abcdef";
$x =~ /abc(?{print "Hi Mom!";})def/; # matches,
# prints 'Hi Mom!'
$x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
# no 'Hi Mom!'
|
|
Pay careful attention to the next example:
$x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
# no 'Hi Mom!'
# but why not?
|
|
At first glance, you'd think that it shouldn't print, because obviously the ddd
isn't going to match the target string. But look at this example:
$x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
# but _does_ print
|
|
Hmm. What happened here? If you've been following along, you know that the above pattern
should be effectively the same as the last one -- enclosing the d in a character class isn't
going to change what it matches. So why does the first not print while the second one does?
The answer lies in the optimizations the REx engine makes. In the first case, all the engine
sees are plain old characters (aside from the ?{} construct). It's smart enough to
realize that the string 'ddd' doesn't occur in our target string before actually running the
pattern through. But in the second case, we've tricked it into thinking that our pattern is more
complicated than it is. It takes a look, sees our character class, and decides that it will have
to actually run the pattern to determine whether or not it matches, and in the process of
running it hits the print statement before it discovers that we don't have a match.
To take a closer look at how the engine does optimizations, see the section "Pragmas and debugging" below.
More fun with ?{}:
$x =~ /(?{print "Hi Mom!";})/; # matches,
# prints 'Hi Mom!'
$x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
# prints '1'
$x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
# prints '1'
|
|
The bit of magic mentioned in the section title occurs when the regexp backtracks in the
process of searching for a match. If the regexp backtracks over a code expression and if the
variables used within are localized using local, the changes in the variables
produced by the code expression are undone! Thus, if we wanted to count how many times a
character got matched inside a group, we could use, e.g.,
$x = "aaaa";
$count = 0; # initialize 'a' count
$c = "bob"; # test if $c gets clobbered
$x =~ /(?{local $c = 0;}) # initialize count
( a # match 'a'
(?{local $c = $c + 1;}) # increment count
)* # do this any number of times,
aa # but match 'aa' at the end
(?{$count = $c;}) # copy local $c var into $count
/x;
print "'a' count is $count, \$c variable is '$c'\n";
|
|
This prints
'a' count is 2, $c variable is 'bob'
|
|
If we replace the (?{local $c = $c + 1;})
with (?{$c = $c + 1;}) , the variable changes are not
undone during backtracking, and we get
'a' count is 4, $c variable is 'bob'
|
|
Note that only localized variable changes are undone. Other side effects of code expression
execution are permanent. Thus
$x = "aaaa";
$x =~ /(a(?{print "Yow\n";}))*aa/;
|
|
produces
The result $^R is automatically localized, so that it will behave properly in
the presence of backtracking.
This example uses a code expression in a conditional to match the article 'the' in either
English or German:
$lang = 'DE'; # use German
...
$text = "das";
print "matched\n"
if $text =~ /(?(?{
$lang eq 'EN'; # is the language English?
})
the | # if so, then match 'the'
(die|das|der) # else, match 'die|das|der'
)
/xi;
|
|
Note that the syntax here is (?(?{...})yes-regexp|no-regexp), not (?((?{...}))yes-regexp|no-regexp).
In other words, in the case of a code expression, we don't need the extra parentheses around the
conditional.
If you try to use code expressions with interpolating variables, perl may surprise you:
$bar = 5;
$pat = '(?{ 1 })';
/foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
/foo(?{ 1 })$bar/; # compile error!
/foo${pat}bar/; # compile error!
$pat = qr/(?{ $foo = 1 })/; # precompile code regexp
/foo${pat}bar/; # compiles ok
|
|
If a regexp has (1) code expressions and interpolating variables,or (2) a variable that
interpolates a code expression, perl treats the regexp as an error. If the code expression is
precompiled into a variable, however, interpolating is ok. The question is, why is this an
error?
The reason is that variable interpolation and code expressions together pose a security risk.
The combination is dangerous because many programmers who write search engines often take user
input and plug it directly into a regexp:
$regexp = <>; # read user-supplied regexp
$chomp $regexp; # get rid of possible newline
$text =~ /$regexp/; # search $text for the $regexp
|
|
If the $regexp variable contains a code expression, the user could then execute
arbitrary Perl code. For instance, some joker could search for system('rm -rf *');
to erase your files. In this sense, the combination of interpolation and code expressions taints
your regexp. So by default, using both interpolation and code expressions in the same regexp is
not allowed. If you're not concerned about malicious users, it is possible to bypass this
security check by invoking use re 'eval' :
use re 'eval'; # throw caution out the door
$bar = 5;
$pat = '(?{ 1 })';
/foo(?{ 1 })$bar/; # compiles ok
/foo${pat}bar/; # compiles ok
|
|
Another form of code expression is the pattern code expression . The
pattern code expression is like a regular code expression, except that the result of the code
evaluation is treated as a regular expression and matched immediately. A simple example is
$length = 5;
$char = 'a';
$x = 'aaaaabb';
$x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
|
|
This final example contains both ordinary and pattern code expressions. It detects if a
binary string 1101010010001... has a Fibonacci spacing 0,1,1,2,3,5,... of the 1's:
$s0 = 0; $s1 = 1; # initial conditions
$x = "1101010010001000001";
print "It is a Fibonacci sequence\n"
if $x =~ /^1 # match an initial '1'
(
(??{'0' x $s0}) # match $s0 of '0'
1 # and then a '1'
(?{
$largest = $s0; # largest seq so far
$s2 = $s1 + $s0; # compute next term
$s0 = $s1; # in Fibonacci sequence
$s1 = $s2;
})
)+ # repeat as needed
$ # that is all there is
/x;
print "Largest sequence matched was $largest\n";
|
|
This prints
It is a Fibonacci sequence
Largest sequence matched was 5
|
|
Ha! Try that with your garden variety regexp package...
Note that the variables $s0 and $s1 are not substituted when the
regexp is compiled, as happens for ordinary variables outside a code expression. Rather, the
code expressions are evaluated when perl encounters them during the search for a match.
The regexp without the //x modifier is
/^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/;
|
|
and is a great start on an Obfuscated Perl entry :-) When working with code and conditional
expressions, the extended form of regexps is almost necessary in creating and debugging regexps.
|
|