|
This section concerns the lookahead and lookbehind assertions. First, a little background.
In Perl regular expressions, most regexp elements 'eat up' a certain amount of string when
they match. For instance, the regexp element [abc}] eats up one character of the
string when it matches, in the sense that perl moves to the next character position in the
string after the match. There are some elements, however, that don't eat up characters (advance
the character position) if they match. The examples we have seen so far are the anchors. The
anchor ^ matches the beginning of the line, but doesn't eat any characters.
Similarly, the word boundary anchor \b matches, e.g., if the character to the left
is a word character and the character to the right is a non-word character, but it doesn't eat
up any characters itself. Anchors are examples of 'zero-width assertions'. Zero-width, because
they consume no characters, and assertions, because they test some property of the string. In
the context of our walk in the woods analogy to regexp matching, most regexp elements move us
along a trail, but anchors have us stop a moment and check our surroundings. If the local
environment checks out, we can proceed forward. But if the local environment doesn't satisfy us,
we must backtrack.
Checking the environment entails either looking ahead on the trail, looking behind, or both. ^
looks behind, to see that there are no characters before. $ looks ahead, to see
that there are no characters after. \b looks both ahead and behind, to see if the
characters on either side differ in their 'word'-ness.
The lookahead and lookbehind assertions are generalizations of the anchor concept. Lookahead
and lookbehind are zero-width assertions that let us specify which characters we want to test
for. The lookahead assertion is denoted by (?=regexp) and the lookbehind assertion
is denoted by (?<=fixed-regexp). Some examples are
$x = "I catch the housecat 'Tom-cat' with catnip";
$x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat'
@catwords = ($x =~ /(?<=\s)cat\w+/g); # matches,
# $catwords[0] = 'catch'
# $catwords[1] = 'catnip'
$x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat'
$x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
# middle of $x
|
|
Note that the parentheses in (?=regexp) and (?<=regexp) are
non-capturing, since these are zero-width assertions. Thus in the second regexp, the substrings
captured are those of the whole regexp itself. Lookahead (?=regexp) can match
arbitrary regexps, but lookbehind (?<=fixed-regexp) only works for regexps of
fixed width, i.e., a fixed number of characters long. Thus (?<=(ab|bc)) is fine,
but (?<=(ab)*) is not. The negated versions of the lookahead and lookbehind
assertions are denoted by (?!regexp) and (?<!fixed-regexp)
respectively. They evaluate true if the regexps do not match:
$x = "foobar";
$x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo'
$x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo'
$x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo'
|
|
|
|