Website hosting service by Active-Venture.com
  

 Back to Index

Using regular expressions in Perl

The last topic of Part 1 briefly covers how regexps are used in Perl programs. Where do they fit into Perl syntax?

We have already introduced the matching operator in its default /regexp/ and arbitrary delimiter m!regexp! forms. We have used the binding operator =~ and its negation !~ to test for string matches. Associated with the matching operator, we have discussed the single line //s, multi-line //m, case-insensitive //i and extended //x modifiers.

There are a few more things you might want to know about matching operators. First, we pointed out earlier that variables in regexps are substituted before the regexp is evaluated:

 
    $pattern = 'Seuss';
    while (<>) {
        print if /$pattern/;
    }  

This will print any lines containing the word Seuss. It is not as efficient as it could be, however, because perl has to re-evaluate $pattern each time through the loop. If $pattern won't be changing over the lifetime of the script, we can add the //o modifier, which directs perl to only perform variable substitutions once:

 
    #!/usr/bin/perl
    #    Improved simple_grep
    $regexp = shift;
    while (<>) {
        print if /$regexp/o;  # a good deal faster
    }  

If you change $pattern after the first substitution happens, perl will ignore it. If you don't want any substitutions at all, use the special delimiter m'':

 
    $pattern = 'Seuss';
    while (<>) {
        print if m'$pattern';  # matches '$pattern', not 'Seuss'
    }  

m'' acts like single quotes on a regexp; all other m delimiters act like double quotes. If the regexp evaluates to the empty string, the regexp in the last successful match is used instead. So we have

 
    "dog" =~ /d/;  # 'd' matches
    "dogbert =~ //;  # this matches the 'd' regexp used before  

The final two modifiers //g and //c concern multiple matches. The modifier //g stands for global matching and allows the matching operator to match within a string as many times as possible. In scalar context, successive invocations against a string will have `//g jump from match to match, keeping track of position in the string as it goes along. You can get or set the position with the pos() function.

The use of //g is shown in the following example. Suppose we have a string that consists of words separated by spaces. If we know how many words there are in advance, we could extract the words using groupings:

 
    $x = "cat dog house"; # 3 words
    $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
                                           # $1 = 'cat'
                                           # $2 = 'dog'
                                           # $3 = 'house'  

But what if we had an indeterminate number of words? This is the sort of task //g was made for. To extract all words, form the simple regexp (\w+) and loop over all matches with /(\w+)/g:

 
    while ($x =~ /(\w+)/g) {
        print "Word is $1, ends at position ", pos $x, "\n";
    }  

prints

 
    Word is cat, ends at position 3
    Word is dog, ends at position 7
    Word is house, ends at position 13  

A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the //c, as in /regexp/gc. The current position in the string is associated with the string, not the regexp. This means that different strings have different positions and their respective positions can be set or read independently.

In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regexp. So if we wanted just the words, we could use

 
    @words = ($x =~ /(\w+)/g);  # matches,
                                # $word[0] = 'cat'
                                # $word[1] = 'dog'
                                # $word[2] = 'house'  

Closely associated with the //g modifier is the \G anchor. The \G anchor matches at the point where the previous //g match left off. \G allows us to easily do context-sensitive matching:

 
    $metric = 1;  # use metric units
    ...
    $x = <FILE>;  # read in measurement
    $x =~ /^([+-]?\d+)\s*/g;  # get magnitude
    $weight = $1;
    if ($metric) { # error checking
        print "Units error!" unless $x =~ /\Gkg\./g;
    }
    else {
        print "Units error!" unless $x =~ /\Glbs\./g;
    }
    $x =~ /\G\s+(widget|sprocket)/g;  # continue processing  

The combination of //g and \G allows us to process the string a bit at a time and use arbitrary Perl logic to decide what to do next. Currently, the \G anchor is only fully supported when used to anchor to the start of the pattern.

\G is also invaluable in processing fixed length records with regexps. Suppose we have a snippet of coding region DNA, encoded as base pair letters ATCGTTGAAT... and we want to find all the stop codons TGA. In a coding region, codons are 3-letter sequences, so we can think of the DNA snippet as a sequence of 3-letter records. The naive regexp

 
    # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
    $dna = "ATCGTTGAATGCAAATGACATGAC";
    $dna =~ /TGA/;  

doesn't work; it may match a TGA, but there is no guarantee that the match is aligned with codon boundaries, e.g., the substring GTT GAA  gives a match. A better solution is

 
    while ($dna =~ /(\w\w\w)*?TGA/g) {  # note the minimal *?
        print "Got a TGA stop codon at position ", pos $dna, "\n";
    }  

which prints

 
    Got a TGA stop codon at position 18
    Got a TGA stop codon at position 23  

Position 18 is good, but position 23 is bogus. What happened?

The answer is that our regexp works well until we get past the last real match. Then the regexp will fail to match a synchronized TGA and start stepping ahead one character position at a time, not what we want. The solution is to use \G to anchor the match to the codon alignment:

 
    while ($dna =~ /\G(\w\w\w)*?TGA/g) {
        print "Got a TGA stop codon at position ", pos $dna, "\n";
    }  

This prints

 
    Got a TGA stop codon at position 18  

which is the correct answer. This example illustrates that it is important not only to match what is desired, but to reject what is not desired.

search and replace

Regular expressions also play a big role in search and replace operations in Perl. Search and replace is accomplished with the s/// operator. The general form is s/regexp/replacement/modifiers, with everything we know about regexps and modifiers applying in this case as well. The replacement is a Perl double quoted string that replaces in the string whatever is matched with the regexp. The operator =~ is also used here to associate a string with s///. If matching against $_, the $_ =~  can be dropped. If there is a match, s/// returns the number of substitutions made, otherwise it returns false. Here are a few examples:

 
    $x = "Time to feed the cat!";
    $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
    if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
        $more_insistent = 1;
    }
    $y = "'quoted words'";
    $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
                           # $y contains "quoted words"  

In the last example, the whole string was matched, but only the part inside the single quotes was grouped. With the s/// operator, the matched variables $1, $2, etc. are immediately available for use in the replacement expression, so we use $1 to replace the quoted string with just what was quoted. With the global modifier, s///g will search and replace all occurrences of the regexp in the string:

 
    $x = "I batted 4 for 4";
    $x =~ s/4/four/;   # doesn't do it all:
                       # $x contains "I batted four for 4"
    $x = "I batted 4 for 4";
    $x =~ s/4/four/g;  # does it all:
                       # $x contains "I batted four for four"  

If you prefer 'regex' over 'regexp' in this tutorial, you could use the following program to replace it:

 
    % cat > simple_replace
    #!/usr/bin/perl
    $regexp = shift;
    $replacement = shift;
    while (<>) {
        s/$regexp/$replacement/go;
        print;
    }
    ^D

    % simple_replace regexp regex perlretut.pod  

In simple_replace we used the s///g modifier to replace all occurrences of the regexp on each line and the s///o modifier to compile the regexp only once. As with simple_grep, both the print and the s/$regexp/$replacement/go use $_ implicitly.

A modifier available specifically to search and replace is the s///e evaluation modifier. s///e wraps an eval{...} around the replacement string and the evaluated result is substituted for the matched substring. s///e is useful if you need to do a bit of computation in the process of replacing text. This example counts character frequencies in a line:

 
    $x = "Bill the cat";
    $x =~ s/(.)/$chars{$1}++;$1/eg;  # final $1 replaces char with itself
    print "frequency of '$_' is $chars{$_}\n"
        foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);  

This prints

 
    frequency of ' ' is 2
    frequency of 't' is 2
    frequency of 'l' is 2
    frequency of 'B' is 1
    frequency of 'c' is 1
    frequency of 'e' is 1
    frequency of 'h' is 1
    frequency of 'i' is 1
    frequency of 'a' is 1  

As with the match m// operator, s/// can use other delimiters, such as s!!! and s{}{}, and even s{}//. If single quotes are used s''', then the regexp and replacement are treated as single quoted strings and there are no substitutions. s/// in list context returns the same thing as in scalar context, i.e., the number of matches.

The split operator

The split function can also optionally use a matching operator m// to split a string. split /regexp/, string, limit splits string into a list of substrings and returns that list. The regexp is used to match the character sequence that the string is split with respect to. The limit, if present, constrains splitting into no more than limit number of strings. For example, to split a string into words, use

 
    $x = "Calvin and Hobbes";
    @words = split /\s+/, $x;  # $word[0] = 'Calvin'
                               # $word[1] = 'and'
                               # $word[2] = 'Hobbes'  

If the empty regexp // is used, the regexp always matches and the string is split into individual characters. If the regexp has groupings, then list produced contains the matched substrings from the groupings as well. For instance,

 
    $x = "/usr/bin/perl";
    @dirs = split m!/!, $x;  # $dirs[0] = ''
                             # $dirs[1] = 'usr'
                             # $dirs[2] = 'bin'
                             # $dirs[3] = 'perl'
    @parts = split m!(/)!, $x;  # $parts[0] = ''
                                # $parts[1] = '/'
                                # $parts[2] = 'usr'
                                # $parts[3] = '/'
                                # $parts[4] = 'bin'
                                # $parts[5] = '/'
                                # $parts[6] = 'perl'  

Since the first character of $x matched the regexp, split prepended an empty initial element to the list.

If you have read this far, congratulations! You now have all the basic tools needed to use regular expressions to solve a wide range of text processing problems. If this is your first time through the tutorial, why not stop here and play around with regexps a while... Part 2 concerns the more esoteric aspects of regular expressions and those concepts certainly aren't needed right at the start.

 

 

 

Domain name registration service & domain search - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ Of course, the best way to get accurate information on Usenet is to post something wrong and wait for corrections.   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com