|
The last topic of Part 1 briefly covers how regexps are used in Perl programs. Where do they
fit into Perl syntax?
We have already introduced the matching operator in its default /regexp/ and
arbitrary delimiter m!regexp! forms. We have used the binding operator =~
and its negation !~ to test for string matches. Associated with the matching
operator, we have discussed the single line //s, multi-line //m,
case-insensitive //i and extended //x modifiers.
There are a few more things you might want to know about matching operators. First, we
pointed out earlier that variables in regexps are substituted before the regexp is evaluated:
$pattern = 'Seuss';
while (<>) {
print if /$pattern/;
}
|
|
This will print any lines containing the word Seuss. It is not as efficient as
it could be, however, because perl has to re-evaluate $pattern each time through
the loop. If $pattern won't be changing over the lifetime of the script, we can add
the //o modifier, which directs perl to only perform variable substitutions once:
#!/usr/bin/perl
# Improved simple_grep
$regexp = shift;
while (<>) {
print if /$regexp/o; # a good deal faster
}
|
|
If you change $pattern after the first substitution happens, perl will ignore
it. If you don't want any substitutions at all, use the special delimiter m'':
$pattern = 'Seuss';
while (<>) {
print if m'$pattern'; # matches '$pattern', not 'Seuss'
}
|
|
m'' acts like single quotes on a regexp; all other m delimiters act
like double quotes. If the regexp evaluates to the empty string, the regexp in the last
successful match is used instead. So we have
"dog" =~ /d/; # 'd' matches
"dogbert =~ //; # this matches the 'd' regexp used before
|
|
The final two modifiers //g and //c concern multiple matches. The
modifier //g stands for global matching and allows the matching operator to match
within a string as many times as possible. In scalar context, successive invocations against a
string will have `//g jump from match to match, keeping track of position in the
string as it goes along. You can get or set the position with the pos() function.
The use of //g is shown in the following example. Suppose we have a string that
consists of words separated by spaces. If we know how many words there are in advance, we could
extract the words using groupings:
$x = "cat dog house"; # 3 words
$x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
# $1 = 'cat'
# $2 = 'dog'
# $3 = 'house'
|
|
But what if we had an indeterminate number of words? This is the sort of task //g
was made for. To extract all words, form the simple regexp (\w+) and loop over all
matches with /(\w+)/g:
while ($x =~ /(\w+)/g) {
print "Word is $1, ends at position ", pos $x, "\n";
}
|
|
prints
Word is cat, ends at position 3
Word is dog, ends at position 7
Word is house, ends at position 13
|
|
A failed match or changing the target string resets the position. If you don't want the
position reset after failure to match, add the //c, as in /regexp/gc.
The current position in the string is associated with the string, not the regexp. This means
that different strings have different positions and their respective positions can be set or
read independently.
In list context, //g returns a list of matched groupings, or if there are no
groupings, a list of matches to the whole regexp. So if we wanted just the words, we could use
@words = ($x =~ /(\w+)/g); # matches,
# $word[0] = 'cat'
# $word[1] = 'dog'
# $word[2] = 'house'
|
|
Closely associated with the //g modifier is the \G anchor. The \G
anchor matches at the point where the previous //g match left off. \G
allows us to easily do context-sensitive matching:
$metric = 1; # use metric units
...
$x = <FILE>; # read in measurement
$x =~ /^([+-]?\d+)\s*/g; # get magnitude
$weight = $1;
if ($metric) { # error checking
print "Units error!" unless $x =~ /\Gkg\./g;
}
else {
print "Units error!" unless $x =~ /\Glbs\./g;
}
$x =~ /\G\s+(widget|sprocket)/g; # continue processing
|
|
The combination of //g and \G allows us to process the string a bit
at a time and use arbitrary Perl logic to decide what to do next. Currently, the \G
anchor is only fully supported when used to anchor to the start of the pattern.
\G is also invaluable in processing fixed length records with regexps. Suppose
we have a snippet of coding region DNA, encoded as base pair letters ATCGTTGAAT...
and we want to find all the stop codons TGA. In a coding region, codons are
3-letter sequences, so we can think of the DNA snippet as a sequence of 3-letter records. The
naive regexp
# expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
$dna = "ATCGTTGAATGCAAATGACATGAC";
$dna =~ /TGA/;
|
|
doesn't work; it may match a TGA, but there is no guarantee that the match is
aligned with codon boundaries, e.g., the substring GTT GAA gives a
match. A better solution is
while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *?
print "Got a TGA stop codon at position ", pos $dna, "\n";
}
|
|
which prints
Got a TGA stop codon at position 18
Got a TGA stop codon at position 23
|
|
Position 18 is good, but position 23 is bogus. What happened?
The answer is that our regexp works well until we get past the last real match. Then the
regexp will fail to match a synchronized TGA and start stepping ahead one character
position at a time, not what we want. The solution is to use \G to anchor the match
to the codon alignment:
while ($dna =~ /\G(\w\w\w)*?TGA/g) {
print "Got a TGA stop codon at position ", pos $dna, "\n";
}
|
|
This prints
Got a TGA stop codon at position 18
|
|
which is the correct answer. This example illustrates that it is important not only to match
what is desired, but to reject what is not desired.
search and replace
Regular expressions also play a big role in search and replace operations in Perl.
Search and replace is accomplished with the s/// operator. The general form is s/regexp/replacement/modifiers,
with everything we know about regexps and modifiers applying in this case as well. The replacement
is a Perl double quoted string that replaces in the string whatever is matched with the regexp.
The operator =~ is also used here to associate a string with s///. If
matching against $_, the $_ =~ can be dropped. If there is
a match, s/// returns the number of substitutions made, otherwise it returns false.
Here are a few examples:
$x = "Time to feed the cat!";
$x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
$more_insistent = 1;
}
$y = "'quoted words'";
$y =~ s/^'(.*)'$/$1/; # strip single quotes,
# $y contains "quoted words"
|
|
In the last example, the whole string was matched, but only the part inside the single quotes
was grouped. With the s/// operator, the matched variables $1, $2,
etc. are immediately available for use in the replacement expression, so we use $1
to replace the quoted string with just what was quoted. With the global modifier, s///g
will search and replace all occurrences of the regexp in the string:
$x = "I batted 4 for 4";
$x =~ s/4/four/; # doesn't do it all:
# $x contains "I batted four for 4"
$x = "I batted 4 for 4";
$x =~ s/4/four/g; # does it all:
# $x contains "I batted four for four"
|
|
If you prefer 'regex' over 'regexp' in this tutorial, you could use the following program to
replace it:
% cat > simple_replace
#!/usr/bin/perl
$regexp = shift;
$replacement = shift;
while (<>) {
s/$regexp/$replacement/go;
print;
}
^D
% simple_replace regexp regex perlretut.pod
|
|
In simple_replace we used the s///g modifier to replace all
occurrences of the regexp on each line and the s///o modifier to compile the regexp
only once. As with simple_grep, both the print and the s/$regexp/$replacement/go
use $_ implicitly.
A modifier available specifically to search and replace is the s///e evaluation
modifier. s///e wraps an eval{...} around the replacement string and
the evaluated result is substituted for the matched substring. s///e is useful if
you need to do a bit of computation in the process of replacing text. This example counts
character frequencies in a line:
$x = "Bill the cat";
$x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
print "frequency of '$_' is $chars{$_}\n"
foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
|
|
This prints
frequency of ' ' is 2
frequency of 't' is 2
frequency of 'l' is 2
frequency of 'B' is 1
frequency of 'c' is 1
frequency of 'e' is 1
frequency of 'h' is 1
frequency of 'i' is 1
frequency of 'a' is 1
|
|
As with the match m// operator, s/// can use other delimiters, such
as s!!! and s{}{}, and even s{}//. If single quotes are
used s''', then the regexp and replacement are treated as single quoted strings and
there are no substitutions. s/// in list context returns the same thing as in
scalar context, i.e., the number of matches.
The split operator
The split function can also optionally use a matching operator m//
to split a string. split /regexp/, string, limit splits string into a
list of substrings and returns that list. The regexp is used to match the character sequence
that the string is split with respect to. The limit, if present,
constrains splitting into no more than limit number of strings. For example, to
split a string into words, use
$x = "Calvin and Hobbes";
@words = split /\s+/, $x; # $word[0] = 'Calvin'
# $word[1] = 'and'
# $word[2] = 'Hobbes'
|
|
If the empty regexp // is used, the regexp always matches and the string is
split into individual characters. If the regexp has groupings, then list produced contains the
matched substrings from the groupings as well. For instance,
$x = "/usr/bin/perl";
@dirs = split m!/!, $x; # $dirs[0] = ''
# $dirs[1] = 'usr'
# $dirs[2] = 'bin'
# $dirs[3] = 'perl'
@parts = split m!(/)!, $x; # $parts[0] = ''
# $parts[1] = '/'
# $parts[2] = 'usr'
# $parts[3] = '/'
# $parts[4] = 'bin'
# $parts[5] = '/'
# $parts[6] = 'perl'
|
|
Since the first character of $x matched the regexp, split prepended an empty
initial element to the list.
If you have read this far, congratulations! You now have all the basic tools needed to use
regular expressions to solve a wide range of text processing problems. If this is your first
time through the tutorial, why not stop here and play around with regexps a while... Part 2
concerns the more esoteric aspects of regular expressions and those concepts certainly aren't
needed right at the start.
|