|
Perl 5.6.0 introduced Unicode support. It's important for porters and XS writers to
understand this support and make sure that the code they write does not corrupt Unicode data.
In the olden, less enlightened times, we all used to use ASCII. Most of us did, anyway. The
big problem with ASCII is that it's American. Well, no, that's not actually the problem; the
problem is that it's not particularly useful for people who don't use the Roman alphabet. What
used to happen was that particular languages would stick their own alphabet in the upper range
of the sequence, between 128 and 255. Of course, we then ended up with plenty of variants that
weren't quite ASCII, and the whole point of it being a standard was lost.
Worse still, if you've got a language like Chinese or Japanese that has hundreds or
thousands of characters, then you really can't fit them into a mere 256, so they had to forget
about ASCII altogether, and build their own systems using pairs of numbers to refer to one
character.
To fix this, some people formed Unicode, Inc. and produced a new character set containing
all the characters you can possibly think of and more. There are several ways of representing
these characters, and the one Perl uses is called UTF8. UTF8 uses a variable number of bytes
to represent a character, instead of just one. You can learn more about Unicode at http://www.unicode.org/
You can't. This is because UTF8 data is stored in bytes just like non-UTF8 data. The
Unicode character 200, (0xC8 for you hex types) capital E with a grave accent, is
represented by the two bytes v196.172. Unfortunately, the non-Unicode string chr(196).chr(172)
has that byte sequence as well. So you can't tell just by looking - this is what makes Unicode
input an interesting problem.
The API function is_utf8_string can help; it'll tell you if a string contains
only valid UTF8 characters. However, it can't do the work for you. On a character-by-character
basis, is_utf8_char will tell you whether the current character in a string is
valid UTF8.
As mentioned above, UTF8 uses a variable number of bytes to store a character. Characters
with values 1...128 are stored in one byte, just like good ol' ASCII. Character 129 is stored
as v194.129; this continues up to character 191, which is v194.191.
Now we've run out of bits (191 is binary 10111111) so we move on; 192 is v195.128.
And so it goes on, moving to three bytes at character 2048.
Assuming you know you're dealing with a UTF8 string, you can find out how long the first
character in it is with the UTF8SKIP macro:
char *utf = "\305\233\340\240\201";
I32 len;
len = UTF8SKIP(utf); /* len is 2 here */
utf += len;
len = UTF8SKIP(utf); /* len is 3 here */
|
|
Another way to skip over characters in a UTF8 string is to use utf8_hop, which
takes a string and a number of characters to skip over. You're on your own about bounds
checking, though, so don't use it lightly.
All bytes in a multi-byte UTF8 character will have the high bit set, so you can test if you
need to do something special with this character like this:
UV uv;
if (utf & 0x80)
/* Must treat this as UTF8 */
uv = utf8_to_uv(utf);
else
/* OK to treat this character as a byte */
uv = *utf;
|
|
You can also see in that example that we use utf8_to_uv to get the value of
the character; the inverse function uv_to_utf8 is available for putting a UV into
UTF8:
if (uv > 0x80)
/* Must treat this as UTF8 */
utf8 = uv_to_utf8(utf8, uv);
else
/* OK to treat this character as a byte */
*utf8++ = uv;
|
|
You must convert characters to UVs using the above functions if you're ever in a
situation where you have to match UTF8 and non-UTF8 characters. You may not skip over UTF8
characters in this case. If you do this, you'll lose the ability to match hi-bit non-UTF8
characters; for instance, if your UTF8 string contains v196.172, and you skip
that character, you can never match a chr(200) in a non-UTF8 string. So don't do
that!
Currently, Perl deals with Unicode strings and non-Unicode strings slightly differently. If
a string has been identified as being UTF-8 encoded, Perl will set a flag in the SV, SVf_UTF8.
You can check and manipulate this flag with the following macros:
SvUTF8(sv)
SvUTF8_on(sv)
SvUTF8_off(sv)
|
|
This flag has an important effect on Perl's treatment of the string: if Unicode data is not
properly distinguished, regular expressions, length, substr and
other string handling operations will have undesirable results.
The problem comes when you have, for instance, a string that isn't flagged is UTF8, and
contains a byte sequence that could be UTF8 - especially when combining non-UTF8 and UTF8
strings.
Never forget that the SVf_UTF8 flag is separate to the PV value; you need be
sure you don't accidentally knock it off while you're manipulating SVs. More specifically, you
cannot expect to do this:
SV *sv;
SV *nsv;
STRLEN len;
char *p;
p = SvPV(sv, len);
frobnicate(p);
nsv = newSVpvn(p, len);
|
|
The char* string does not tell you the whole story, and you can't copy or
reconstruct an SV just by copying the string value. Check if the old SV has the UTF8 flag set,
and act accordingly:
p = SvPV(sv, len);
frobnicate(p);
nsv = newSVpvn(p, len);
if (SvUTF8(sv))
SvUTF8_on(nsv);
|
|
In fact, your frobnicate function should be made aware of whether or not it's
dealing with UTF8 data, so that it can handle the string appropriately.
If you're mixing UTF8 and non-UTF8 strings, you might find it necessary to upgrade one of
the strings to UTF8. If you've got an SV, the easiest way to do this is:
However, you must not do this, for example:
if (!SvUTF8(left))
sv_utf8_upgrade(left);
|
|
If you do this in a binary operator, you will actually change one of the strings that came
into the operator, and, while it shouldn't be noticeable by the end user, it can cause
problems.
Instead, bytes_to_utf8 will give you a UTF8-encoded copy of its string
argument. This is useful for having the data available for comparisons and so on, without
harming the original SV. There's also utf8_to_bytes to go the other way, but
naturally, this will fail if the string contains any characters above 255 that can't be
represented in a single byte.
Not really. Just remember these things:
- There's no way to tell if a string is UTF8 or not. You can tell if an SV is UTF8 by
looking at is
SvUTF8 flag. Don't forget to set the flag if something should
be UTF8. Treat the flag as part of the PV, even though it's not - if you pass on the PV to
somewhere, pass on the flag too.
- If a string is UTF8, always use
utf8_to_uv to get at the value,
unless !(*s & 0x80) in which case you can use *s.
- When writing to a UTF8 string, always use
uv_to_utf8, unless uv
< 0x80 in which case you can use *s = uv.
- Mixing UTF8 and non-UTF8 strings is tricky. Use
bytes_to_utf8 to get a new
string which is UTF8 encoded. There are tricks you can use to delay deciding whether you
need to use a UTF8 string until you get to a high character - HALF_UPGRADE is
one of those.
|
|