Charset to UTF
Good Old Old Days Is there any other language but American ?? EBCDIC ASCII
Good Old Days Ascii: – latin – French,Italian, German etc. or Greek or Hebrew or Russian etc.
Multibyte Japanese – SJIS, EUC Chinese – Big5, GB Korean
Babel’s Tower
Many Languages Hebrew Japanese Arabic In the same doc/line/screen
Unicode All Languages Each char – 2 bytes – problem: Not string - wide char
UTF8 One to one with Unicode 1-3 regular chars Well defined algorithm
Hebrew to Unicode 05D0 60 HEBREW LETTER ALEF 05D1 61 HEBREW LETTER BET 05D2 62 HEBREW LETTER GIMEL 05D3 63 HEBREW LETTER DALET 05D4 64 HEBREW LETTER HE 05D5 65 HEBREW LETTER VAV 05D6 66 HEBREW LETTER ZAYIN 05D7 67 HEBREW LETTER HET 05D8 68 HEBREW LETTER TET 05D9 69 HEBREW LETTER YOD 05DA 6A HEBREW LETTER FINAL KAF 05DB 6B HEBREW LETTER KAF 05DC 6C HEBREW LETTER LAMED 05DD 6D HEBREW LETTER FINAL MEM 05DE 6E HEBREW LETTER MEM and likewise for each charset
Need for Conversion Existing Data New data: Editors work in specific charsets, not in utf/unicode
Brute Force Foreach org_char convert to utf
Perl way 1 use ENCODE; ($if, open my $in, "<:encoding(iso )", $if; open my $out, ">:encoding(utf8)", $of; while( ) { print $out $_; } close $in;
Perl way 2 perl -MEncode -e '($if, my $in, " :encoding(utf8)", $of;while( ){ print $out $_; }' infile outfile