Website hosting service by Active-Venture.com
  

 Back to Index

Part 2: Power tools

OK, you know the basics of regexps and you want to know more. If matching regular expressions is analogous to a walk in the woods, then the tools discussed in Part 1 are analogous to topo maps and a compass, basic tools we use all the time. Most of the tools in part 2 are analogous to flare guns and satellite phones. They aren't used too often on a hike, but when we are stuck, they can be invaluable.

What follows are the more advanced, less used, or sometimes esoteric capabilities of perl regexps. In Part 2, we will assume you are comfortable with the basics and concentrate on the new features.

More on characters, strings, and character classes

There are a number of escape sequences and character classes that we haven't covered yet.

There are several escape sequences that convert characters or strings between upper and lower case. \l and \u convert the next character to lower or upper case, respectively:

 
    $x = "perl";
    $string =~ /\u$x/;  # matches 'Perl' in $string
    $x = "M(rs?|s)\\."; # note the double backslash
    $string =~ /\l$x/;  # matches 'mr.', 'mrs.', and 'ms.',  

\L and \U converts a whole substring, delimited by \L or \U and \E, to lower or upper case:

 
    $x = "This word is in lower case:\L SHOUT\E";
    $x =~ /shout/;       # matches
    $x = "I STILL KEYPUNCH CARDS FOR MY 360"
    $x =~ /\Ukeypunch/;  # matches punch card string  

If there is no \E, case is converted until the end of the string. The regexps \L\u$word or \u\L$word convert the first character of $word to uppercase and the rest of the characters to lowercase.

Control characters can be escaped with \c, so that a control-Z character would be matched with \cZ. The escape sequence \Q...\E quotes, or protects most non-alphabetic characters. For instance,

 
    $x = "\QThat !^*&%~& cat!";
    $x =~ /\Q!^*&%~&\E/;  # check for rough language  

It does not protect $ or @, so that variables can still be substituted.

With the advent of 5.6.0, perl regexps can handle more than just the standard ASCII character set. Perl now supports Unicode, a standard for encoding the character sets from many of the world's written languages. Unicode does this by allowing characters to be more than one byte wide. Perl uses the UTF-8 encoding, in which ASCII characters are still encoded as one byte, but characters greater than chr(127) may be stored as two or more bytes.

What does this mean for regexps? Well, regexp users don't need to know much about perl's internal representation of strings. But they do need to know 1) how to represent Unicode characters in a regexp and 2) when a matching operation will treat the string to be searched as a sequence of bytes (the old way) or as a sequence of Unicode characters (the new way). The answer to 1) is that Unicode characters greater than chr(127) may be represented using the \x{hex} notation, with hex a hexadecimal integer:

 
    /\x{263a}/;  # match a Unicode smiley face :)  

Unicode characters in the range of 128-255 use two hexadecimal digits with braces: \x{ab}. Note that this is different than \xab, which is just a hexadecimal byte with no Unicode significance.

NOTE: in Perl 5.6.0 it used to be that one needed to say use utf8 to use any Unicode features. This is no more the case: for almost all Unicode processing, the explicit utf8 pragma is not needed. (The only case where it matters is if your Perl script is in Unicode and encoded in UTF-8, then an explicit use utf8 is needed.)

Figuring out the hexadecimal sequence of a Unicode character you want or deciphering someone else's hexadecimal Unicode regexp is about as much fun as programming in machine code. So another way to specify Unicode characters is to use the named character  escape sequence \N{name}. name is a name for the Unicode character, as specified in the Unicode standard. For instance, if we wanted to represent or match the astrological sign for the planet Mercury, we could use

 
    use charnames ":full"; # use named chars with Unicode full names
    $x = "abc\N{MERCURY}def";
    $x =~ /\N{MERCURY}/;   # matches  

One can also use short names or restrict names to a certain alphabet:

 
    use charnames ':full';
    print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";

    use charnames ":short";
    print "\N{greek:Sigma} is an upper-case sigma.\n";

    use charnames qw(greek);
    print "\N{sigma} is Greek sigma\n";  

A list of full names is found in the file Names.txt in the lib/perl5/5.X.X/unicore directory.

The answer to requirement 2), as of 5.6.0, is that if a regexp contains Unicode characters, the string is searched as a sequence of Unicode characters. Otherwise, the string is searched as a sequence of bytes. If the string is being searched as a sequence of Unicode characters, but matching a single byte is required, we can use the \C escape sequence. \C is a character class akin to . except that it matches any byte 0-255. So

 
    use charnames ":full"; # use named chars with Unicode full names
    $x = "a";
    $x =~ /\C/;  # matches 'a', eats one byte
    $x = "";
    $x =~ /\C/;  # doesn't match, no bytes to match
    $x = "\N{MERCURY}";  # two-byte Unicode character
    $x =~ /\C/;  # matches, but dangerous!  

The last regexp matches, but is dangerous because the string character position is no longer synchronized to the string byte position. This generates the warning 'Malformed UTF-8 character'. \C is best used for matching the binary data in strings with binary data intermixed with Unicode characters.

Let us now discuss the rest of the character classes. Just as with Unicode characters, there are named Unicode character classes represented by the \p{name} escape sequence. Closely associated is the \P{name} character class, which is the negation of the \p{name} class. For example, to match lower and uppercase characters,

 
    use charnames ":full"; # use named chars with Unicode full names
    $x = "BOB";
    $x =~ /^\p{IsUpper}/;   # matches, uppercase char class
    $x =~ /^\P{IsUpper}/;   # doesn't match, char class sans uppercase
    $x =~ /^\p{IsLower}/;   # doesn't match, lowercase char class
    $x =~ /^\P{IsLower}/;   # matches, char class sans lowercase  

Here is the association between some Perl named classes and the traditional Unicode classes:

 
    Perl class name  Unicode class name or regular expression

    IsAlpha          /^[LM]/
    IsAlnum          /^[LMN]/
    IsASCII          $code <= 127
    IsCntrl          /^C/
    IsBlank          $code =~ /^(0020|0009)$/ || /^Z[^lp]/
    IsDigit          Nd
    IsGraph          /^([LMNPS]|Co)/
    IsLower          Ll
    IsPrint          /^([LMNPS]|Co|Zs)/
    IsPunct          /^P/
    IsSpace          /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/
    IsSpacePerl      /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/
    IsUpper          /^L[ut]/
    IsWord           /^[LMN]/ || $code eq "005F"
    IsXDigit         $code =~ /^00(3[0-9]|[46][1-6])$/  

You can also use the official Unicode class names with the \p and \P, like \p{L} for Unicode 'letters', or \p{Lu} for uppercase letters, or \P{Nd} for non-digits. If a name is just one letter, the braces can be dropped. For instance, \pM is the character class of Unicode 'marks', for example accent marks. For the full list see perlunicode.

The Unicode has also been separated into various sets of charaters which you can test with \p{In...} (in) and \P{In...} (not in), for example \p{Latin}, \p{Greek}, or \P{Katakana}. For the full list see perlunicode.

\X is an abbreviation for a character class sequence that includes the Unicode 'combining character sequences'. A 'combining character sequence' is a base character followed by any number of combining characters. An example of a combining character is an accent. Using the Unicode full names, e.g., A + COMBINING RING  is a combining character sequence with base character A and combining character COMBINING RING , which translates in Danish to A with the circle atop it, as in the word Angstrom. \X is equivalent to \PM\pM*}, i.e., a non-mark followed by one or more marks.

For the full and latest information about Unicode see the latest Unicode standard, or the Unicode Consortium's website http://www.unicode.org/

As if all those classes weren't enough, Perl also defines POSIX style character classes. These have the form [:name:], with name the name of the POSIX class. The POSIX classes are alpha, alnum, ascii, cntrl, digit, graph, lower, print, punct, space, upper, and xdigit, and two extensions, word (a Perl extension to match \w), and blank (a GNU extension). If utf8 is being used, then these classes are defined the same as their corresponding perl Unicode classes: [:upper:] is the same as \p{IsUpper}, etc. The POSIX character classes, however, don't require using utf8. The [:digit:], [:word:], and [:space:] correspond to the familiar \d, \w, and \s character classes. To negate a POSIX class, put a ^ in front of the name, so that, e.g., [:^digit:] corresponds to \D and under utf8, \P{IsDigit}. The Unicode and POSIX character classes can be used just like \d, with the exception that POSIX character classes can only be used inside of a character class:

 
    /\s+[abc[:digit:]xyz]\s*/;  # match a,b,c,x,y,z, or a digit
    /^=item\s[[:digit:]]/;      # match '=item',
                                # followed by a space and a digit
    use charnames ":full";
    /\s+[abc\p{IsDigit}xyz]\s+/;  # match a,b,c,x,y,z, or a digit
    /^=item\s\p{IsDigit}/;        # match '=item',
                                  # followed by a space and a digit  

Whew! That is all the rest of the characters and character classes.

 

 

 

Domain name registration service & domain search - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ Software Engineering is that part of Computer Science which is too difficult for the Computer Scientist.   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com