Website hosting service by Active-Venture.com
  

 Back to Index

Extracting matches

The grouping metacharacters () also serve another completely different function: they allow the extraction of the parts of a string that matched. This is very useful to find out what matched and for text processing in general. For each grouping, the part that matched inside goes into the special variables $1, $2, etc. They can be used just as ordinary variables:

 
    # extract hours, minutes, seconds
    $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
    $hours = $1;
    $minutes = $2;
    $seconds = $3;  

Now, we know that in scalar context, $time =~ /(\d\d):(\d\d):(\d\d)/  returns a true or false value. In list context, however, it returns the list of matched values ($1,$2,$3). So we could write the code more compactly as

 
    # extract hours, minutes, seconds
    ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);  

If the groupings in a regexp are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc. For example, here is a complex regexp and the matching variables indicated below it:

 
    /(ab(cd|ef)((gi)|j))/;
     1  2      34  

so that if the regexp matched, e.g., $2 would contain 'cd' or 'ef'. For convenience, perl sets $+ to the string held by the highest numbered $1, $2, ... that got assigned (and, somewhat related, $^N to the value of the $1, $2, ... most-recently assigned; i.e. the $1, $2, ... associated with the rightmost closing parenthesis used in the match).

Closely associated with the matching variables $1, $2, ... are the backreferences \1, \2, ... . Backreferences are simply matching variables that can be used inside a regexp. This is a really nice feature - what matches later in a regexp can depend on what matched earlier in the regexp. Suppose we wanted to look for doubled words in text, like 'the the'. The following regexp finds all 3-letter doubles with a space in between:

 
    /(\w\w\w)\s\1/;  

The grouping assigns a value to \1, so that the same 3 letter sequence is used for both parts. Here are some words with repeated parts:

 
    % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words
    beriberi
    booboo
    coco
    mama
    murmur
    papa  

The regexp has a single grouping which considers 4-letter combinations, then 3-letter combinations, etc. and uses \1 to look for a repeat. Although $1 and \1 represent the same thing, care should be taken to use matched variables $1, $2, ... only outside a regexp and backreferences \1, \2, ... only inside a regexp; not doing so may lead to surprising and/or undefined results.

In addition to what was matched, Perl 5.6.0 also provides the positions of what was matched with the @- and @+ arrays. $-[0] is the position of the start of the entire match and $+[0] is the position of the end. Similarly, $-[n] is the position of the start of the $n match and $+[n] is the position of the end. If $n is undefined, so are $-[n] and $+[n]. Then this code

 
    $x = "Mmm...donut, thought Homer";
    $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
    foreach $expr (1..$#-) {
        print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";
    }  

prints

 
    Match 1: 'Mmm' at position (0,3)
    Match 2: 'donut' at position (6,11)  

Even if there are no groupings in a regexp, it is still possible to find out what exactly matched in a string. If you use them, perl will set $` to the part of the string before the match, will set $& to the part of the string that matched, and will set $' to the part of the string after the match. An example:

 
    $x = "the cat caught the mouse";
    $x =~ /cat/;  # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
    $x =~ /the/;  # $` = '', $& = 'the', $' = ' cat caught the mouse'  

In the second match, $` = ''  because the regexp matched at the first character position in the string and stopped, it never saw the second 'the'. It is important to note that using $` and $' slows down regexp matching quite a bit, and $& slows it down to a lesser extent, because if they are used in one regexp in a program, they are generated for <all> regexps in the program. So if raw performance is a goal of your application, they should be avoided. If you need them, use @- and @+ instead:

 
    $` is the same as substr( $x, 0, $-[0] )
    $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
    $' is the same as substr( $x, $+[0] )  

 

 

 

Domain name registration service & domain search - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ Around computers it is difficult to find the correct unit of time to measure progress. Some cathedrals took a century to complete. Can you imagine the grandeur and scope of a program that would take as long?   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com