Website hosting service by Active-Venture.com
  

 Back to Index

Exotic Templates

Bit Strings

Bits are the atoms in the memory world. Access to individual bits may have to be used either as a last resort or because it is the most convenient way to handle your data. Bit string (un)packing converts between strings containing a series of 0 and 1 characters and a sequence of bytes each containing a group of 8 bits. This is almost as simple as it sounds, except that there are two ways the contents of a byte may be written as a bit string. Let's have a look at an annotated byte:

 
     7 6 5 4 3 2 1 0
   +-----------------+
   | 1 0 0 0 1 1 0 0 |
   +-----------------+
    MSB           LSB  

It's egg-eating all over again: Some think that as a bit string this should be written "10001100" i.e. beginning with the most significant bit, others insist on "00110001". Well, Perl isn't biased, so that's why we have two bit string codes:

 
   $byte = pack( 'B8', '10001100' ); # start with MSB
   $byte = pack( 'b8', '00110001' ); # start with LSB  

It is not possible to pack or unpack bit fields - just integral bytes. pack always starts at the next byte boundary and "rounds up" to the next multiple of 8 by adding zero bits as required. (If you do want bit fields, there is perlfunc/vec. Or you could implement bit field handling at the character string level, using split, substr, and concatenation on unpacked bit strings.)

To illustrate unpacking for bit strings, we'll decompose a simple status register (a "-" stands for a "reserved" bit):

 
   +-----------------+-----------------+
   | S Z - A - P - C | - - - - O D I T |
   +-----------------+-----------------+
    MSB           LSB MSB           LSB  

Converting these two bytes to a string can be done with the unpack template 'b16'. To obtain the individual bit values from the bit string we use split with the "empty" separator pattern which dissects into individual characters. Bit values from the "reserved" positions are simply assigned to undef, a convenient notation for "I don't care where this goes".

 
   ($carry, undef, $parity, undef, $auxcarry, undef, $sign,
    $trace, $interrupt, $direction, $overflow) =
      split( //, unpack( 'b16', $status ) );  

We could have used an unpack template 'b12' just as well, since the last 4 bits can be ignored anyway.

Uuencoding

Another odd-man-out in the template alphabet is u, which packs an "uuencoded string". ("uu" is short for Unix-to-Unix.) Chances are that you won't ever need this encoding technique which was invented to overcome the shortcomings of old-fashioned transmission mediums that do not support other than simple ASCII data. The essential recipe is simple: Take three bytes, or 24 bits. Split them into 4 six-packs, adding a space (0x20) to each. Repeat until all of the data is blended. Fold groups of 4 bytes into lines no longer than 60 and garnish them in front with the original byte count (incremented by 0x20) and a "\n" at the end. - The pack chef will prepare this for you, a la minute, when you select pack code u on the menu:

 
   my $uubuf = pack( 'u', $bindat );  

A repeat count after u sets the number of bytes to put into an uuencoded line, which is the maximum of 45 by default, but could be set to some (smaller) integer multiple of three. unpack simply ignores the repeat count.

Doing Sums

An even stranger template code is %<number>. First, because it's used as a prefix to some other template code. Second, because it cannot be used in pack at all, and third, in unpack, doesn't return the data as defined by the template code it precedes. Instead it'll give you an integer of number bits that is computed from the data value by doing sums. For numeric unpack codes, no big feat is achieved:

 
    my $buf = pack( 'iii', 100, 20, 3 );
    print unpack( '%32i3', $buf ), "\n";  # prints 123  

For string values, % returns the sum of the byte values saving you the trouble of a sum loop with substr and ord:

 
    print unpack( '%32A*', "\x01\x10" ), "\n";  # prints 17  

Although the % code is documented as returning a "checksum": don't put your trust in such values! Even when applied to a small number of bytes, they won't guarantee a noticeable Hamming distance.

In connection with b or B, % simply adds bits, and this can be put to good use to count set bits efficiently:

 
    my $bitcount = unpack( '%32b*', $mask );  

And an even parity bit can be determined like this:

 
    my $evenparity = unpack( '%1b*', $mask );
  

Unicode

Unicode is a character set that can represent most characters in most of the world's languages, providing room for over one million different characters. Unicode 3.1 specifies 94,140 characters: The Basic Latin characters are assigned to the numbers 0 - 127. The Latin-1 Supplement with characters that are used in several European languages is in the next range, up to 255. After some more Latin extensions we find the character sets from languages using non-Roman alphabets, interspersed with a variety of symbol sets such as currency symbols, Zapf Dingbats or Braille. (You might want to visit www.unicode.org for a look at some of them - my personal favourites are Telugu and Kannada.)

The Unicode character sets associates characters with integers. Encoding these numbers in an equal number of bytes would more than double the requirements for storing texts written in Latin alphabets. The UTF-8 encoding avoids this by storing the most common (from a western point of view) characters in a single byte while encoding the rarer ones in three or more bytes.

So what has this got to do with pack? Well, if you want to convert between a Unicode number and its UTF-8 representation you can do so by using template code U. As an example, let's produce the UTF-8 representation of the Euro currency symbol (code number 0x20AC):

 
   $UTF8{Euro} = pack( 'U', 0x20AC );  

Inspecting $UTF8{Euro} shows that it contains 3 bytes: "\xe2\x82\xac". The round trip can be completed with unpack:

 
   $Unicode{Euro} = unpack( 'U', $UTF8{Euro} );  

Usually you'll want to pack or unpack UTF-8 strings:

 
   # pack and unpack the Hebrew alphabet
   my $alefbet = pack( 'U*', 0x05d0..0x05ea );
   my @hebrew = unpack( 'U*', $utf );
  

Another Portable Binary Encoding

The pack code w has been added to support a portable binary data encoding scheme that goes way beyond simple integers. (Details can be found at Casbah.org, the Scarab project.) A BER (Binary Encoded Representation) compressed unsigned integer stores base 128 digits, most significant digit first, with as few digits as possible. Bit eight (the high bit) is set on each byte except the last. There is no size limit to BER encoding, but Perl won't go to extremes.

 
   my $berbuf = pack( 'w*', 1, 128, 128+1, 128*128+127 );  

A hex dump of $berbuf, with spaces inserted at the right places, shows 01 8100 8101 81807F. Since the last byte is always less than 128, unpack knows where to stop.

 

 

 

Domain name registration service & domain search - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ Computers do not solve problems, they execute solutions.   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com