|
OK, you know the basics of regexps and you want to know more. If matching regular expressions
is analogous to a walk in the woods, then the tools discussed in Part 1 are analogous to topo
maps and a compass, basic tools we use all the time. Most of the tools in part 2 are analogous
to flare guns and satellite phones. They aren't used too often on a hike, but when we are stuck,
they can be invaluable.
What follows are the more advanced, less used, or sometimes esoteric capabilities of perl
regexps. In Part 2, we will assume you are comfortable with the basics and concentrate on the
new features.
There are a number of escape sequences and character classes that we haven't covered yet.
There are several escape sequences that convert characters or strings between upper and lower
case. \l and \u convert the next character to lower or upper case,
respectively:
$x = "perl";
$string =~ /\u$x/; # matches 'Perl' in $string
$x = "M(rs?|s)\\."; # note the double backslash
$string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.',
|
|
\L and \U converts a whole substring, delimited by \L
or \U and \E, to lower or upper case:
$x = "This word is in lower case:\L SHOUT\E";
$x =~ /shout/; # matches
$x = "I STILL KEYPUNCH CARDS FOR MY 360"
$x =~ /\Ukeypunch/; # matches punch card string
|
|
If there is no \E, case is converted until the end of the string. The regexps \L\u$word
or \u\L$word convert the first character of $word to uppercase and the
rest of the characters to lowercase.
Control characters can be escaped with \c, so that a control-Z character would
be matched with \cZ. The escape sequence \Q...\E quotes,
or protects most non-alphabetic characters. For instance,
$x = "\QThat !^*&%~& cat!";
$x =~ /\Q!^*&%~&\E/; # check for rough language
|
|
It does not protect $ or @, so that variables can still be
substituted.
With the advent of 5.6.0, perl regexps can handle more than just the standard ASCII character
set. Perl now supports Unicode, a standard for encoding the character sets from many of
the world's written languages. Unicode does this by allowing characters to be more than one byte
wide. Perl uses the UTF-8 encoding, in which ASCII characters are still encoded as one byte, but
characters greater than chr(127) may be stored as two or more bytes.
What does this mean for regexps? Well, regexp users don't need to know much about perl's
internal representation of strings. But they do need to know 1) how to represent Unicode
characters in a regexp and 2) when a matching operation will treat the string to be searched as
a sequence of bytes (the old way) or as a sequence of Unicode characters (the new way). The
answer to 1) is that Unicode characters greater than chr(127) may be represented
using the \x{hex} notation, with hex a hexadecimal integer:
/\x{263a}/; # match a Unicode smiley face :)
|
|
Unicode characters in the range of 128-255 use two hexadecimal digits with braces: \x{ab}.
Note that this is different than \xab, which is just a hexadecimal byte with no
Unicode significance.
NOTE: in Perl 5.6.0 it used to be that one needed to say use utf8 to use
any Unicode features. This is no more the case: for almost all Unicode processing, the explicit utf8
pragma is not needed. (The only case where it matters is if your Perl script is in Unicode and
encoded in UTF-8, then an explicit use utf8 is needed.)
Figuring out the hexadecimal sequence of a Unicode character you want or deciphering someone
else's hexadecimal Unicode regexp is about as much fun as programming in machine code. So
another way to specify Unicode characters is to use the named character escape
sequence \N{name}. name is a name for the Unicode character, as
specified in the Unicode standard. For instance, if we wanted to represent or match the
astrological sign for the planet Mercury, we could use
use charnames ":full"; # use named chars with Unicode full names
$x = "abc\N{MERCURY}def";
$x =~ /\N{MERCURY}/; # matches
|
|
One can also use short names or restrict names to a certain alphabet:
use charnames ':full';
print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
use charnames ":short";
print "\N{greek:Sigma} is an upper-case sigma.\n";
use charnames qw(greek);
print "\N{sigma} is Greek sigma\n";
|
|
A list of full names is found in the file Names.txt in the lib/perl5/5.X.X/unicore directory.
The answer to requirement 2), as of 5.6.0, is that if a regexp contains Unicode characters,
the string is searched as a sequence of Unicode characters. Otherwise, the string is searched as
a sequence of bytes. If the string is being searched as a sequence of Unicode characters, but
matching a single byte is required, we can use the \C escape sequence. \C
is a character class akin to . except that it matches any byte 0-255. So
use charnames ":full"; # use named chars with Unicode full names
$x = "a";
$x =~ /\C/; # matches 'a', eats one byte
$x = "";
$x =~ /\C/; # doesn't match, no bytes to match
$x = "\N{MERCURY}"; # two-byte Unicode character
$x =~ /\C/; # matches, but dangerous!
|
|
The last regexp matches, but is dangerous because the string character position is no
longer synchronized to the string byte position. This generates the warning 'Malformed
UTF-8 character'. \C is best used for matching the binary data in strings with
binary data intermixed with Unicode characters.
Let us now discuss the rest of the character classes. Just as with Unicode characters, there
are named Unicode character classes represented by the \p{name} escape sequence.
Closely associated is the \P{name} character class, which is the negation of the \p{name}
class. For example, to match lower and uppercase characters,
use charnames ":full"; # use named chars with Unicode full names
$x = "BOB";
$x =~ /^\p{IsUpper}/; # matches, uppercase char class
$x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase
$x =~ /^\p{IsLower}/; # doesn't match, lowercase char class
$x =~ /^\P{IsLower}/; # matches, char class sans lowercase
|
|
Here is the association between some Perl named classes and the traditional Unicode classes:
Perl class name Unicode class name or regular expression
IsAlpha /^[LM]/
IsAlnum /^[LMN]/
IsASCII $code <= 127
IsCntrl /^C/
IsBlank $code =~ /^(0020|0009)$/ || /^Z[^lp]/
IsDigit Nd
IsGraph /^([LMNPS]|Co)/
IsLower Ll
IsPrint /^([LMNPS]|Co|Zs)/
IsPunct /^P/
IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/
IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/
IsUpper /^L[ut]/
IsWord /^[LMN]/ || $code eq "005F"
IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/
|
|
You can also use the official Unicode class names with the \p and \P,
like \p{L} for Unicode 'letters', or \p{Lu} for uppercase letters, or \P{Nd}
for non-digits. If a name is just one letter, the braces can be dropped. For
instance, \pM is the character class of Unicode 'marks', for example accent marks.
For the full list see perlunicode.
The Unicode has also been separated into various sets of charaters which you can test with \p{In...}
(in) and \P{In...} (not in), for example \p{Latin}, \p{Greek},
or \P{Katakana}. For the full list see perlunicode.
\X is an abbreviation for a character class sequence that includes the Unicode
'combining character sequences'. A 'combining character sequence' is a base character followed
by any number of combining characters. An example of a combining character is an accent. Using
the Unicode full names, e.g., A + COMBINING RING is a
combining character sequence with base character A and combining character COMBINING RING ,
which translates in Danish to A with the circle atop it, as in the word Angstrom. \X
is equivalent to \PM\pM*}, i.e., a non-mark followed by one or more marks.
For the full and latest information about Unicode see the latest Unicode standard, or the
Unicode Consortium's website http://www.unicode.org/
As if all those classes weren't enough, Perl also defines POSIX style character classes.
These have the form [:name:], with name the name of the POSIX class.
The POSIX classes are alpha, alnum, ascii, cntrl,
digit, graph, lower, print, punct,
space, upper, and xdigit, and two extensions, word
(a Perl extension to match \w), and blank (a GNU extension). If utf8
is being used, then these classes are defined the same as their corresponding perl Unicode
classes: [:upper:] is the same as \p{IsUpper}, etc. The POSIX
character classes, however, don't require using utf8. The [:digit:], [:word:],
and [:space:] correspond to the familiar \d, \w, and \s
character classes. To negate a POSIX class, put a ^ in front of the name, so that,
e.g., [:^digit:] corresponds to \D and under utf8, \P{IsDigit}.
The Unicode and POSIX character classes can be used just like \d, with the
exception that POSIX character classes can only be used inside of a character class:
/\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit
/^=item\s[[:digit:]]/; # match '=item',
# followed by a space and a digit
use charnames ":full";
/\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit
/^=item\s\p{IsDigit}/; # match '=item',
# followed by a space and a digit
|
|
Whew! That is all the rest of the characters and character classes.
|
|