|
There are a variety of ways of transforming data with an intra character set mapping that
serve a variety of purposes. Sorting was discussed in the previous section and a few of the
other more popular mapping techniques are discussed next.
Note that some URLs have hexadecimal ASCII code points in them in an attempt to overcome
character or protocol limitation issues. For example the tilde character is not on every
keyboard hence a URL of the form:
http://www.pvhp.com/~pvhp/
|
|
may also be expressed as either of:
http://www.pvhp.com/%7Epvhp/
http://www.pvhp.com/%7epvhp/
|
|
where 7E is the hexadecimal ASCII code point for '~'. Here is an example of decoding such a
URL under CCSID 1047:
$url = 'http://www.pvhp.com/%7Epvhp/';
# this array assumes code page 1047
my @a2e_1047 = (
0, 1, 2, 3, 55, 45, 46, 47, 22, 5, 21, 11, 12, 13, 14, 15,
16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31,
64, 90,127,123, 91,108, 80,125, 77, 93, 92, 78,107, 96, 75, 97,
240,241,242,243,244,245,246,247,248,249,122, 94, 76,126,110,111,
124,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214,
215,216,217,226,227,228,229,230,231,232,233,173,224,189, 95,109,
121,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150,
151,152,153,162,163,164,165,166,167,168,169,192, 79,208,161, 7,
32, 33, 34, 35, 36, 37, 6, 23, 40, 41, 42, 43, 44, 9, 10, 27,
48, 49, 26, 51, 52, 53, 54, 8, 56, 57, 58, 59, 4, 20, 62,255,
65,170, 74,177,159,178,106,181,187,180,154,138,176,202,175,188,
144,143,234,250,190,160,182,179,157,218,155,139,183,184,185,171,
100,101, 98,102, 99,103,158,104,116,113,114,115,120,117,118,119,
172,105,237,238,235,239,236,191,128,253,254,251,252,186,174, 89,
68, 69, 66, 70, 67, 71,156, 72, 84, 81, 82, 83, 88, 85, 86, 87,
140, 73,205,206,203,207,204,225,112,221,222,219,220,141,142,223
);
$url =~ s/%([0-9a-fA-F]{2})/pack("c",$a2e_1047[hex($1)])/ge;
|
|
Conversely, here is a partial solution for the task of encoding such a URL under the 1047
code page:
$url = 'http://www.pvhp.com/~pvhp/';
# this array assumes code page 1047
my @e2a_1047 = (
0, 1, 2, 3,156, 9,134,127,151,141,142, 11, 12, 13, 14, 15,
16, 17, 18, 19,157, 10, 8,135, 24, 25,146,143, 28, 29, 30, 31,
128,129,130,131,132,133, 23, 27,136,137,138,139,140, 5, 6, 7,
144,145, 22,147,148,149,150, 4,152,153,154,155, 20, 21,158, 26,
32,160,226,228,224,225,227,229,231,241,162, 46, 60, 40, 43,124,
38,233,234,235,232,237,238,239,236,223, 33, 36, 42, 41, 59, 94,
45, 47,194,196,192,193,195,197,199,209,166, 44, 37, 95, 62, 63,
248,201,202,203,200,205,206,207,204, 96, 58, 35, 64, 39, 61, 34,
216, 97, 98, 99,100,101,102,103,104,105,171,187,240,253,254,177,
176,106,107,108,109,110,111,112,113,114,170,186,230,184,198,164,
181,126,115,116,117,118,119,120,121,122,161,191,208, 91,222,174,
172,163,165,183,169,167,182,188,189,190,221,168,175, 93,180,215,
123, 65, 66, 67, 68, 69, 70, 71, 72, 73,173,244,246,242,243,245,
125, 74, 75, 76, 77, 78, 79, 80, 81, 82,185,251,252,249,250,255,
92,247, 83, 84, 85, 86, 87, 88, 89, 90,178,212,214,210,211,213,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57,179,219,220,217,218,159
);
# The following regular expression does not address the
# mappings for: ('.' => '%2E', '/' => '%2F', ':' => '%3A')
$url =~ s/([\t "#%&\(\),;<=>\?\@\[\\\]^`{|}~])/sprintf("%%%02X",$e2a_1047[ord($1)])/ge;
|
|
where a more complete solution would split the URL into components and apply a full s///
substitution only to the appropriate parts.
In the remaining examples a @e2a or @a2e array may be employed but the assignment will not
be shown explicitly. For code page 1047 you could use the @a2e_1047 or @e2a_1047 arrays just
shown.
The u template to pack() or unpack() will render EBCDIC data in EBCDIC
characters equivalent to their ASCII counterparts. For example, the following will print
"Yes indeed\n" on either an ASCII or EBCDIC computer:
$all_byte_chrs = '';
for (0..255) { $all_byte_chrs .= chr($_); }
$uuencode_byte_chrs = pack('u', $all_byte_chrs);
($uu = <<'ENDOFHEREDOC') =~ s/^\s*//gm;
M``$"`P0%!@<("0H+#`T.#Q`1$A,4%187&!D:&QP='A\@(2(C)"4F)R@I*BLL
M+2XO,#$R,S0U-C<X.3H[/#T^/T!!0D-$149'2$E*2TQ-3D]045)35%565UA9
M6EM<75Y?8&%B8V1E9F=H:6IK;&UN;W!Q<G-T=79W>'EZ>WQ]?G^`@8*#A(6&
MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S
MM+6VM[BYNKN\O;Z_P,'"P\3%QL?(R<K+S,W.S]#1TM/4U=;7V-G:V]S=WM_@
?X>+CY.7FY^CIZNOL[>[O\/'R\_3U]O?X^?K[_/W^_P``
ENDOFHEREDOC
if ($uuencode_byte_chrs eq $uu) {
print "Yes ";
}
$uudecode_byte_chrs = unpack('u', $uuencode_byte_chrs);
if ($uudecode_byte_chrs eq $all_byte_chrs) {
print "indeed\n";
}
|
|
Here is a very spartan uudecoder that will work on EBCDIC provided that the @e2a array is
filled in appropriately:
#!/usr/local/bin/perl
@e2a = ( # this must be filled in
);
$_ = <> until ($mode,$file) = /^begin\s*(\d*)\s*(\S*)/;
open(OUT, "> $file") if $file ne "";
while(<>) {
last if /^end/;
next if /[a-z]/;
next unless int(((($e2a[ord()] - 32 ) & 077) + 2) / 3) ==
int(length() / 4);
print OUT unpack("u", $_);
}
close(OUT);
chmod oct($mode), $file;
|
|
On ASCII encoded machines it is possible to strip characters outside of the printable set
using:
# This QP encoder works on ASCII only
$qp_string =~ s/([=\x00-\x1F\x80-\xFF])/sprintf("=%02X",ord($1))/ge;
|
|
Whereas a QP encoder that works on both ASCII and EBCDIC machines would look somewhat like
the following (where the EBCDIC branch @e2a array is omitted for brevity):
if (ord('A') == 65) { # ASCII
$delete = "\x7F"; # ASCII
@e2a = (0 .. 255) # ASCII to ASCII identity map
}
else { # EBCDIC
$delete = "\x07"; # EBCDIC
@e2a = # EBCDIC to ASCII map (as shown above)
}
$qp_string =~
s/([^ !"\#\$%&'()*+,\-.\/0-9:;<>?\@A-Z[\\\]^_`a-z{|}~$delete])/sprintf("=%02X",$e2a[ord($1)])/ge;
|
|
(although in production code the substitutions might be done in the EBCDIC branch with the
@e2a array and separately in the ASCII branch without the expense of the identity map).
Such QP strings can be decoded with:
# This QP decoder is limited to ASCII only
$string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr hex $1/ge;
$string =~ s/=[\n\r]+$//;
|
|
Whereas a QP decoder that works on both ASCII and EBCDIC machines would look somewhat like
the following (where the @a2e array is omitted for brevity):
$string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr $a2e[hex $1]/ge;
$string =~ s/=[\n\r]+$//;
|
|
The practice of shifting an alphabet one or more characters for encipherment dates back
thousands of years and was explicitly detailed by Gaius Julius Caesar in his Gallic Wars
text. A single alphabet shift is sometimes referred to as a rotation and the shift amount is
given as a number $n after the string 'rot' or "rot$n". Rot0 and rot26 would
designate identity maps on the 26 letter English version of the Latin alphabet. Rot13 has the
interesting property that alternate subsequent invocations are identity maps (thus rot13 is
its own non-trivial inverse in the group of 26 alphabet rotations). Hence the following is a
rot13 encoder and decoder that will work on ASCII and EBCDIC machines:
#!/usr/local/bin/perl
while(<>){
tr/n-za-mN-ZA-M/a-zA-Z/;
print;
}
|
|
In one-liner form:
perl -ne 'tr/n-za-mN-ZA-M/a-zA-Z/;print'
|
|
|
|