Website hosting service by Active-Venture.com
  

 Back to Index

Unicode Support

Perl 5.6.0 introduced Unicode support. It's important for porters and XS writers to understand this support and make sure that the code they write does not corrupt Unicode data.

What is Unicode, anyway?

In the olden, less enlightened times, we all used to use ASCII. Most of us did, anyway. The big problem with ASCII is that it's American. Well, no, that's not actually the problem; the problem is that it's not particularly useful for people who don't use the Roman alphabet. What used to happen was that particular languages would stick their own alphabet in the upper range of the sequence, between 128 and 255. Of course, we then ended up with plenty of variants that weren't quite ASCII, and the whole point of it being a standard was lost.

Worse still, if you've got a language like Chinese or Japanese that has hundreds or thousands of characters, then you really can't fit them into a mere 256, so they had to forget about ASCII altogether, and build their own systems using pairs of numbers to refer to one character.

To fix this, some people formed Unicode, Inc. and produced a new character set containing all the characters you can possibly think of and more. There are several ways of representing these characters, and the one Perl uses is called UTF8. UTF8 uses a variable number of bytes to represent a character, instead of just one. You can learn more about Unicode at http://www.unicode.org/

How can I recognise a UTF8 string?

You can't. This is because UTF8 data is stored in bytes just like non-UTF8 data. The Unicode character 200, (0xC8 for you hex types) capital E with a grave accent, is represented by the two bytes v196.172. Unfortunately, the non-Unicode string chr(196).chr(172) has that byte sequence as well. So you can't tell just by looking - this is what makes Unicode input an interesting problem.

The API function is_utf8_string can help; it'll tell you if a string contains only valid UTF8 characters. However, it can't do the work for you. On a character-by-character basis, is_utf8_char will tell you whether the current character in a string is valid UTF8.

How does UTF8 represent Unicode characters?

As mentioned above, UTF8 uses a variable number of bytes to store a character. Characters with values 1...128 are stored in one byte, just like good ol' ASCII. Character 129 is stored as v194.129; this continues up to character 191, which is v194.191. Now we've run out of bits (191 is binary 10111111) so we move on; 192 is v195.128. And so it goes on, moving to three bytes at character 2048.

Assuming you know you're dealing with a UTF8 string, you can find out how long the first character in it is with the UTF8SKIP macro:

 
    char *utf = "\305\233\340\240\201";
    I32 len;

    len = UTF8SKIP(utf); /* len is 2 here */
    utf += len;
    len = UTF8SKIP(utf); /* len is 3 here */  

Another way to skip over characters in a UTF8 string is to use utf8_hop, which takes a string and a number of characters to skip over. You're on your own about bounds checking, though, so don't use it lightly.

All bytes in a multi-byte UTF8 character will have the high bit set, so you can test if you need to do something special with this character like this:

 
    UV uv;

    if (utf & 0x80)
        /* Must treat this as UTF8 */
        uv = utf8_to_uv(utf);
    else
        /* OK to treat this character as a byte */
        uv = *utf;  

You can also see in that example that we use utf8_to_uv to get the value of the character; the inverse function uv_to_utf8 is available for putting a UV into UTF8:

 
    if (uv > 0x80)
        /* Must treat this as UTF8 */
        utf8 = uv_to_utf8(utf8, uv);
    else
        /* OK to treat this character as a byte */
        *utf8++ = uv;  

You must convert characters to UVs using the above functions if you're ever in a situation where you have to match UTF8 and non-UTF8 characters. You may not skip over UTF8 characters in this case. If you do this, you'll lose the ability to match hi-bit non-UTF8 characters; for instance, if your UTF8 string contains v196.172, and you skip that character, you can never match a chr(200) in a non-UTF8 string. So don't do that!

How does Perl store UTF8 strings?

Currently, Perl deals with Unicode strings and non-Unicode strings slightly differently. If a string has been identified as being UTF-8 encoded, Perl will set a flag in the SV, SVf_UTF8. You can check and manipulate this flag with the following macros:

 
    SvUTF8(sv)
    SvUTF8_on(sv)
    SvUTF8_off(sv)  

This flag has an important effect on Perl's treatment of the string: if Unicode data is not properly distinguished, regular expressions, length, substr and other string handling operations will have undesirable results.

The problem comes when you have, for instance, a string that isn't flagged is UTF8, and contains a byte sequence that could be UTF8 - especially when combining non-UTF8 and UTF8 strings.

Never forget that the SVf_UTF8 flag is separate to the PV value; you need be sure you don't accidentally knock it off while you're manipulating SVs. More specifically, you cannot expect to do this:

 
    SV *sv;
    SV *nsv;
    STRLEN len;
    char *p;

    p = SvPV(sv, len);
    frobnicate(p);
    nsv = newSVpvn(p, len);  

The char* string does not tell you the whole story, and you can't copy or reconstruct an SV just by copying the string value. Check if the old SV has the UTF8 flag set, and act accordingly:

 
    p = SvPV(sv, len);
    frobnicate(p);
    nsv = newSVpvn(p, len);
    if (SvUTF8(sv))
        SvUTF8_on(nsv);  

In fact, your frobnicate function should be made aware of whether or not it's dealing with UTF8 data, so that it can handle the string appropriately.

How do I convert a string to UTF8?

If you're mixing UTF8 and non-UTF8 strings, you might find it necessary to upgrade one of the strings to UTF8. If you've got an SV, the easiest way to do this is:

 
    sv_utf8_upgrade(sv);  

However, you must not do this, for example:

 
    if (!SvUTF8(left))
        sv_utf8_upgrade(left);  

If you do this in a binary operator, you will actually change one of the strings that came into the operator, and, while it shouldn't be noticeable by the end user, it can cause problems.

Instead, bytes_to_utf8 will give you a UTF8-encoded copy of its string argument. This is useful for having the data available for comparisons and so on, without harming the original SV. There's also utf8_to_bytes to go the other way, but naturally, this will fail if the string contains any characters above 255 that can't be represented in a single byte.

Is there anything else I need to know?

Not really. Just remember these things:

  • There's no way to tell if a string is UTF8 or not. You can tell if an SV is UTF8 by looking at is SvUTF8 flag. Don't forget to set the flag if something should be UTF8. Treat the flag as part of the PV, even though it's not - if you pass on the PV to somewhere, pass on the flag too.
  • If a string is UTF8, always use utf8_to_uv to get at the value, unless !(*s & 0x80) in which case you can use *s.
  • When writing to a UTF8 string, always use uv_to_utf8, unless uv < 0x80 in which case you can use *s = uv.
  • Mixing UTF8 and non-UTF8 strings is tricky. Use bytes_to_utf8 to get a new string which is UTF8 encoded. There are tricks you can use to delay deciding whether you need to use a UTF8 string until you get to a high character - HALF_UPGRADE is one of those.

 

  

 

Domain name registration - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ The disadvantage of working over networks is that you can't so easily go into someone else's office and rip their bloody heart out   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com