LOCALE CATEGORIES
The following subsections describe basic locale categories. Beyond these, some combination
categories allow manipulation of more than one basic category at a time. See "ENVIRONMENT" for a discussion of these.
In the scope of use locale, Perl looks to the LC_COLLATE
environment variable to determine the application's notions on collation (ordering) of
characters. For example, 'b' follows 'a' in Latin alphabets, but where do 'á' and 'å'
belong? And while 'color' follows 'chocolate' in English, what about in Spanish?
The following collations all make sense and you may meet any of them if you "use
locale".
A B C D E a b c d e
A a B b C c D d E e
a A b B c C d D e E
a b c d e A B C D E
|
|
Here is a code snippet to tell what "word" characters are in the current locale,
in that locale's order:
use locale;
print +(sort grep /\w/, map { chr } 0..255), "\n";
|
|
Compare this with the characters that you see and their order if you state explicitly that
the locale should be ignored:
no locale;
print +(sort grep /\w/, map { chr } 0..255), "\n";
|
|
This machine-native collation (which is what you get unless use locale has
appeared earlier in the same block) must be used for sorting raw binary data, whereas the
locale-dependent collation of the first example is useful for natural text.
As noted in USING LOCALES, cmp compares according
to the current collation locale when use locale is in effect, but falls back to a
char-by-char comparison for strings that the locale says are equal. You can use POSIX::strcoll()
if you don't want this fall-back:
use POSIX qw(strcoll);
$equal_in_locale =
!strcoll("space and case ignored", "SpaceAndCaseIgnored");
|
|
$equal_in_locale will be true if the collation locale specifies a dictionary-like ordering
that ignores space characters completely and which folds case.
If you have a single string that you want to check for "equality in locale"
against several others, you might think you could gain a little efficiency by using
POSIX::strxfrm() in conjunction with eq:
use POSIX qw(strxfrm);
$xfrm_string = strxfrm("Mixed-case string");
print "locale collation ignores spaces\n"
if $xfrm_string eq strxfrm("Mixed-casestring");
print "locale collation ignores hyphens\n"
if $xfrm_string eq strxfrm("Mixedcase string");
print "locale collation ignores case\n"
if $xfrm_string eq strxfrm("mixed-case string");
|
|
strxfrm() takes a string and maps it into a transformed string for use in char-by-char
comparisons against other transformed strings during collation. "Under the hood",
locale-affected Perl comparison operators call strxfrm() for both operands, then do a
char-by-char comparison of the transformed strings. By calling strxfrm() explicitly and using
a non locale-affected comparison, the example attempts to save a couple of transformations.
But in fact, it doesn't save anything: Perl magic (see perlguts/Magic
Variables) creates the transformed version of a string the first time it's needed in a
comparison, then keeps this version around in case it's needed again. An example rewritten the
easy way with cmp runs just about as fast. It also copes with null characters
embedded in strings; if you call strxfrm() directly, it treats the first null it finds as a
terminator. don't expect the transformed strings it produces to be portable across systems--or
even from one revision of your operating system to the next. In short, don't call strxfrm()
directly: let Perl do it for you.
Note: use locale isn't shown in some of these examples because it isn't
needed: strcoll() and strxfrm() exist only to generate locale-dependent results, and so always
obey the current LC_COLLATE locale.
In the scope of use locale, Perl obeys the LC_CTYPE locale
setting. This controls the application's notion of which characters are alphabetic. This
affects Perl's \w regular expression metanotation, which stands for alphanumeric
characters--that is, alphabetic, numeric, and including other special characters such as the
underscore or hyphen. (Consult perlre
for more information about regular expressions.) Thanks to LC_CTYPE, depending on
your locale setting, characters like 'æ', 'ð', 'ß', and 'ø' may be understood as \w
characters.
The LC_CTYPE locale also provides the map used in transliterating characters
between lower and uppercase. This affects the case-mapping functions--lc(), lcfirst, uc(), and
ucfirst(); case-mapping interpolation with \l, \L, \u,
or \U in double-quoted strings and s/// substitutions; and
case-independent regular expression pattern matching using the i modifier.
Finally, LC_CTYPE affects the POSIX character-class test functions--isalpha(),
islower(), and so on. For example, if you move from the "C" locale to a 7-bit
Scandinavian one, you may find--possibly to your surprise--that "|" moves from the
ispunct() class to isalpha().
Note: A broken or malicious LC_CTYPE locale definition may result in
clearly ineligible characters being considered to be alphanumeric by your application. For
strict matching of (mundane) letters and digits--for example, in command strings--locale-aware
applications should use \w inside a no locale block. See "SECURITY".
In the scope of use locale, Perl obeys the LC_NUMERIC locale
information, which controls an application's idea of how numbers should be formatted for human
readability by the printf(), sprintf(), and write() functions. String-to-numeric conversion by
the POSIX::strtod() function is also affected. In most implementations the only effect is to
change the character used for the decimal point--perhaps from '.' to ','. These functions
aren't aware of such niceties as thousands separation and so on. (See The localeconv function if you care about these things.)
Output produced by print() is also affected by the current locale: it depends on whether use
locale or no locale is in effect, and corresponds to what you'd get from
printf() in the "C" locale. The same is true for Perl's internal conversions between
numeric and string formats:
use POSIX qw(strtod);
use locale;
$n = 5/2; # Assign numeric 2.5 to $n
$a = " $n"; # Locale-dependent conversion to string
print "half five is $n\n"; # Locale-dependent output
printf "half five is %g\n", $n; # Locale-dependent output
print "DECIMAL POINT IS COMMA\n"
if $n == (strtod("2,5"))[0]; # Locale-dependent conversion
|
|
See also I18N::Langinfo and RADIXCHAR.
The C standard defines the LC_MONETARY category, but no function that is
affected by its contents. (Those with experience of standards committees will recognize that
the working group decided to punt on the issue.) Consequently, Perl takes no notice of it. If
you really want to use LC_MONETARY, you can query its contents--see The localeconv function--and use the information that it
returns in your application's own formatting of currency amounts. However, you may well find
that the information, voluminous and complex though it may be, still does not quite meet your
requirements: currency formatting is a hard nut to crack.
See also I18N::Langinfo and CRNCYSTR.
Output produced by POSIX::strftime(), which builds a formatted human-readable date/time
string, is affected by the current LC_TIME locale. Thus, in a French locale, the
output produced by the %B format element (full month name) for the first month of
the year would be "janvier". Here's how to get a list of long month names in the
current locale:
use POSIX qw(strftime);
for (0..11) {
$long_month_name[$_] =
strftime("%B", 0, 0, 0, 1, $_, 96);
}
|
|
Note: use locale isn't needed in this example: as a function that exists only
to generate locale-dependent results, strftime() always obeys the current LC_TIME
locale.
See also I18N::Langinfo and ABDAY_1..ABDAY_7,
DAY_1..DAY_7, ABMON_1..ABMON_12, and ABMON_1..ABMON_12.
The remaining locale category, LC_MESSAGES (possibly supplemented by others in
particular implementations) is not currently used by Perl--except possibly to affect the
behavior of library functions called by extensions outside the standard Perl distribution and
by the operating system and its utilities. Note especially that the string value of $!
and the error messages given by external utilities may be changed by LC_MESSAGES.
If you want to have portable error codes, use %!. See Errno.
|
|