|
Bits are the atoms in the memory world. Access to individual bits may have to be used either
as a last resort or because it is the most convenient way to handle your data. Bit string (un)packing
converts between strings containing a series of 0 and 1 characters and
a sequence of bytes each containing a group of 8 bits. This is almost as simple as it sounds,
except that there are two ways the contents of a byte may be written as a bit string. Let's have
a look at an annotated byte:
7 6 5 4 3 2 1 0
+-----------------+
| 1 0 0 0 1 1 0 0 |
+-----------------+
MSB LSB
|
|
It's egg-eating all over again: Some think that as a bit string this should be written
"10001100" i.e. beginning with the most significant bit, others insist on
"00110001". Well, Perl isn't biased, so that's why we have two bit string codes:
$byte = pack( 'B8', '10001100' ); # start with MSB
$byte = pack( 'b8', '00110001' ); # start with LSB
|
|
It is not possible to pack or unpack bit fields - just integral bytes. pack
always starts at the next byte boundary and "rounds up" to the next multiple of 8 by
adding zero bits as required. (If you do want bit fields, there is perlfunc/vec. Or you could
implement bit field handling at the character string level, using split, substr, and
concatenation on unpacked bit strings.)
To illustrate unpacking for bit strings, we'll decompose a simple status register (a
"-" stands for a "reserved" bit):
+-----------------+-----------------+
| S Z - A - P - C | - - - - O D I T |
+-----------------+-----------------+
MSB LSB MSB LSB
|
|
Converting these two bytes to a string can be done with the unpack template 'b16'.
To obtain the individual bit values from the bit string we use split with the
"empty" separator pattern which dissects into individual characters. Bit values from
the "reserved" positions are simply assigned to undef, a convenient
notation for "I don't care where this goes".
($carry, undef, $parity, undef, $auxcarry, undef, $sign,
$trace, $interrupt, $direction, $overflow) =
split( //, unpack( 'b16', $status ) );
|
|
We could have used an unpack template 'b12' just as well, since the last 4 bits
can be ignored anyway.
Another odd-man-out in the template alphabet is u, which packs an
"uuencoded string". ("uu" is short for Unix-to-Unix.) Chances are that you
won't ever need this encoding technique which was invented to overcome the shortcomings of
old-fashioned transmission mediums that do not support other than simple ASCII data. The
essential recipe is simple: Take three bytes, or 24 bits. Split them into 4 six-packs, adding a
space (0x20) to each. Repeat until all of the data is blended. Fold groups of 4 bytes into lines
no longer than 60 and garnish them in front with the original byte count (incremented by 0x20)
and a "\n" at the end. - The pack chef will prepare this for
you, a la minute, when you select pack code u on the menu:
my $uubuf = pack( 'u', $bindat );
|
|
A repeat count after u sets the number of bytes to put into an uuencoded line,
which is the maximum of 45 by default, but could be set to some (smaller) integer multiple of
three. unpack simply ignores the repeat count.
An even stranger template code is %<number>. First, because it's
used as a prefix to some other template code. Second, because it cannot be used in pack
at all, and third, in unpack, doesn't return the data as defined by the template
code it precedes. Instead it'll give you an integer of number bits that is computed from
the data value by doing sums. For numeric unpack codes, no big feat is achieved:
my $buf = pack( 'iii', 100, 20, 3 );
print unpack( '%32i3', $buf ), "\n"; # prints 123
|
|
For string values, % returns the sum of the byte values saving you the trouble
of a sum loop with substr and ord:
print unpack( '%32A*', "\x01\x10" ), "\n"; # prints 17
|
|
Although the % code is documented as returning a "checksum": don't put
your trust in such values! Even when applied to a small number of bytes, they won't guarantee a
noticeable Hamming distance.
In connection with b or B, % simply adds bits, and
this can be put to good use to count set bits efficiently:
my $bitcount = unpack( '%32b*', $mask );
|
|
And an even parity bit can be determined like this:
my $evenparity = unpack( '%1b*', $mask );
|
|
Unicode is a character set that can represent most characters in most of the world's
languages, providing room for over one million different characters. Unicode 3.1 specifies
94,140 characters: The Basic Latin characters are assigned to the numbers 0 - 127. The Latin-1
Supplement with characters that are used in several European languages is in the next range, up
to 255. After some more Latin extensions we find the character sets from languages using
non-Roman alphabets, interspersed with a variety of symbol sets such as currency symbols, Zapf
Dingbats or Braille. (You might want to visit www.unicode.org for a look
at some of them - my personal favourites are Telugu and Kannada.)
The Unicode character sets associates characters with integers. Encoding these numbers in an
equal number of bytes would more than double the requirements for storing texts written in Latin
alphabets. The UTF-8 encoding avoids this by storing the most common (from a western point of
view) characters in a single byte while encoding the rarer ones in three or more bytes.
So what has this got to do with pack? Well, if you want to convert between a
Unicode number and its UTF-8 representation you can do so by using template code U.
As an example, let's produce the UTF-8 representation of the Euro currency symbol (code number
0x20AC):
$UTF8{Euro} = pack( 'U', 0x20AC );
|
|
Inspecting $UTF8{Euro} shows that it contains 3 bytes: "\xe2\x82\xac".
The round trip can be completed with unpack:
$Unicode{Euro} = unpack( 'U', $UTF8{Euro} );
|
|
Usually you'll want to pack or unpack UTF-8 strings:
# pack and unpack the Hebrew alphabet
my $alefbet = pack( 'U*', 0x05d0..0x05ea );
my @hebrew = unpack( 'U*', $utf );
|
|
The pack code w has been added to support a portable binary data encoding scheme
that goes way beyond simple integers. (Details can be found at Casbah.org, the Scarab project.)
A BER (Binary Encoded Representation) compressed unsigned integer stores base 128 digits, most
significant digit first, with as few digits as possible. Bit eight (the high bit) is set on each
byte except the last. There is no size limit to BER encoding, but Perl won't go to extremes.
my $berbuf = pack( 'w*', 1, 128, 128+1, 128*128+127 );
|
|
A hex dump of $berbuf, with spaces inserted at the right places, shows 01 8100
8101 81807F. Since the last byte is always less than 128, unpack knows where to
stop.
|
|