Text encoding: or how to get 黃慧儀 and Ίων Ανδρουτσόπουλος into the same document. Chris Brew Linguistics, Ohio State
Letters and fonts We all know some letters Here are some lower-case letters Abcdefghijklmnopqrstuvwxyz And the upper-case versions of the same ABCDEFGHIJKLMNOPQRSTUVWXYZ The letters above are in a variable-width font called Bradley Hand ITC. Note that upper-case and lower- case letters take different amounts of horizontal space.
Different fonts Here are the same letters in a fixed-width font called Courier New. Now they take the same amount of horizontal space. abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ How do we know that a and a are the same letter? Do we want to say that A and a are the same letter? Until the late 50s, only printers (and perhaps philosophers) needed to even ask these questions.
American Standard Code for Information Interchange (ASCII) In the old days (1963), every computer manufacturer had its own way of encoding numbers, letters and punctuation This became a problem once computers from different manufacturers began to communicate. A message is a sequence of letters and numbers. But messages between computers are made of bits and bytes, not from letters and numbers. The solution is to adopt a standard character code. This is a precise definition of how to convert digits and letters into code points and back. An example of a character code is the venerable A=1,B=2,…,Z=26
American Standard Code for Information Interchange (ASCII) Another is the ASCII character set
Extended ASCII
ASCII Realizing that the primary ASCII content is but a tiny subset of the alphabets and symbols which must be accommodated in a worldwide communication system, Bemer devised (in 1960) the universal switching concept in use today, via the Escape character he caused to be placed in ASCII and its registered alternates. ESC followed by (N tells the receiving device that the following characters will be the Cyrillic equivalent of ASCII, until further changed similarly. Why? Because that sequence introduces Set #37 of the International Register of Coded Character Sets to be used with Escape Sequences, maintained in Geneva.
ASCII ESC [31;42m tells video screens all over the world to change to red letters on a green background. Why? Because that is in the standard, both national and international, for controlling display terminals. Not all video screens now accept and display the Cyrillic alphabet. But they do in Russia, on the Internet. Nearly 200 other sets are now registered. Soon all users worldwide will have equipment that displays a wide range of the world's symbols and characters, alphabets or ideographs (Japanese was registered in 1969).
Character codes for interchange The essential feature of a character code is that it preserve the distinctions that are needed by the messages that we want to transmit. Suddenly the philosophical issues about “what is a letter” might have real bite, because if two characters map to the same code point, any differences between their original forms will be lost in transmission. Also, if this is going to be a long term standard, we have to understand the ways in which user needs are likely to change, and cater for them ahead of time. We might also want our character codes to have other properties, such as conciseness, suitability for use in a particular transmission medium, etc.
Terminology Character repertoire: a set of distinct characters Character code: a map from a character repertoire to a set of non-negative integers The non-negative integers are called code points, code numbers, code values, or code elements Character encoding: an algorithm for mapping sequences of code points into sequences of octets for storage on disk
Examples Character repertoire: "a", "!", and "ä" Character code: in the ISO character code the numeric codes for "a", "!", "ä", and "‰" (per mille sign) are 97, 33, 228, and 8240.ISO Character encoding: In one possible encoding for ISO 10646, the string a!ä‰ is presented as the following sequence of octets (using two octets for each character): 0, 97, 0, 33, 0, 228, 32, 48.ISO 10646
The ASCII character code The character code defined by the ASCII standard is the following: code values are assigned to characters consecutively in the order in which the characters are listed above (rowwise), starting from 32 (assigned to the blank) and ending up with 126 (assigned to the tilde character ~). Positions 0 through 31 and 127 are reserved for control codes. They have standardized names and descriptions, but in fact their usage varies a lot.control codesnames and descriptions
The ASCII character encoding The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value. Octets are not used in ASCII. (This allows programs to use the first, most significant bit of an octet as a parity bit, for example.)parity
National variants of ASCII There are several national variants of ASCII. In such variants, some special characters have been replaced by national letters (and other symbols). There is great variation here, and even within one country and for one language there might be different variants. The original ASCII is therefore often referred to as US-ASCII; the formal standard (by ANSI) is ANSI X ANSI
ISO 646 The international standard ISO 646 defines a character set similar to US-ASCII but with code points corresponding to US-ASCII as "national use positions". It also gives some liberties with characters #$^`~. The standard also defines "international reference version (IRV)", which is (in the 1991 edition of ISO 646) identical to US-ASCII.ISO 646US-ASCII
Using ASCII Mainly due to the "national variants" discussed above, some characters are less "safe" than other, i.e. more often transferred or interpreted incorrectly."national variants" In addition to the letters of the English alphabet ("A" to "Z", and "a" to "z"), the digits ("0" to "9") and the space (" "), only the following characters can be regarded as really "safe" in data transmission: ! " % & ' ( ) * +, -. / : ; ?
ISO Latin 1 In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions , and they are: ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
the Windows character set(s) In ISO , code positions are explicitly reserved for control purposes; they "correspond to bit combinations that do not represent graphic characters". The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is not identical with ISO ISO control purposes Windows character set Windows code page 1252ISO
the Windows character set(s) The Windows character set exists in different variations, or "code pages" (CP), which generally differ from the corresponding ISO 8859 standard so that it contains same characters in positions as code page (However, there are some more differences between ISO and win (WinGreek).) differences between ISO and win (WinGreek)
The ISO 8859 family There are several character codes which are extensions to ASCII in the same sense as ISO and the Windows character set.ASCIIsenseISO Windows character set The ISO 8859 codes extend the ASCII repertoire in different ways with different special characters (used in different languages and cultures). Just as ISO contains ASCII characters and a collection of characters needed in languages of western (and northern) Europe, there is ISO alias ISO Latin 2 constructed similarly for languages of central/eastern Europe, etc. The ISO 8859 character codes are isomorphic in the following sense: code positions contain the same character as in ASCII, positions are unused (reserved for control characters), and positions are the varying part, used differently in different members of the ISO 8859 familyASCIIcontrol characters
European Hegemony Although ISO has been a de facto default encoding in many contexts, it has in principle no special role. And in practice, ISO alias ISO Latin 9 (!) will probably replace ISO to a great extent, since it contains the politically important symbol for euro.ISO alias ISO Latin 9 (!)euro
Other 8-bit codes All the character codes discussed above are "8-bit codes", eight bits are sufficient for presenting the code numbers and in practice the encoding (at least the normal encoding) is the obvious (trivial) one where each code position (thereby, each character) is presented as one octet (byte). This means that there are 256 code positions, but several positions are reserved for control codes or left unused (unassigned, undefined).code numbersencodingcontrol codes To illustrate that other kinds of 8-bit codes can be defined than extensions to Ascii, we briefly consider the EBCDIC code, defined by IBM and once in widespread use on "mainframes" (and still in use). EBCDIC contains all ASCII characters but in quite different code positions. As an interesting detail, in EBCDIC normal letters A - Z do not all appear in consecutive code positions. EBCDIC exists in different national variantsEBCDICIBMmainframescode positions
ISO 10646, UCS, and Unicode ISO (officially: ISO/IEC 10646) is an international standard, by ISO and IEC. It defines UCS, Universal Character Set, which is a very large and growing character repertoire, and a character code for it. Currently tens of thousands of characters have been defined, and new amendments are defined fairly often. It contains, among other things, all characters in the character repertoires discussed above.ISOIECcharacter repertoirecharacter code The number of the standard intentionally reminds us of 646, the number of the ISO standard corresponding to ASCII.ASCII
Unicode, the more practical definition of UCS Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code intended to be fully compatible with ISO 10646, and an encoding for it. ISO is more general (abstract) in nature.standardUnicode Consortium The ISO and Unicode character repertoire can be regarded as a superset of most character repertoires in use. However, the code points of characters are usually different
Encodings for Unicode The "native" Unicode encoding, UCS-2, presents each code number as two consecutive octets m and n so that the number equals 256m+n. The code number is presented as a two-byte integer. This is a very obvious and simple encoding. However, it can be inefficient in terms of the number of octets needed. If we have normal English text or other text which contains ISO Latin 1 characters only, the length of the Unicode encoded octet sequence is twice the length of the string in ISO encoding.ISO Latin 1 For this reason, encodings other than UCS-2 are used.
UTF-8 Character codes less than 128 (effectively, the ASCII repertoire) are presented "as such", using one octet for each code (character) All other codes are presented, according to a relatively complicated method, so that one code (character) is presented as a sequence of two to six octets, each of which is in the range This means that in a sequence of octets, octets in the range ("bytes with most significant bit set to 0") directly represent ASCII characters, whereas octets in the range ("bytes with most significant bit set to 1") are to be interpreted as really encoded presentations of characters. 190 clinton hts ASCII
Chinese There are too many Chinese characters for an 8-bit character code. For Unicode, this is no problem in principle But there is a mess of competing standards and national interests at work
国家标准 GUOJIA BIAOZHUN National Standard (for PRC) GB (organized by row) 01 special symbols (94) 02 paragraph numbers (72) 03 GB (94, = ISO 646-CN) 04 hiragana (83) 05 katakana (86) 06 Greek (48) 07 Cyrillic (66) 08 pinyin accented vowels (26) and zhuyin symbols (37) 09 box and table drawing pieces (76) hanzi level 1 (3755, ordered by pinyin) hanzi level 2 (3008, ordered by radical, then stroke)
国家标准 GUOJIA BIAOZHUN Chinese characters tend to have simplified forms There are variants of GB that replace simplified by government sanctioned traditional characters GBK has lots more characters, and is essentially Unicode, with a few extra bits
Big-5 Not the Taiwan national standard, but much more widely used. Lots of characters (Taiwanese writers don’t use simplified characters much) CNS is based on Big-5 and has 48,711 characters (684 non-hanzi)
Hong Kong GCCS (Government Chinese Character Set) Hong Kong GCCS (Government Chinese Character Set) was created in 1994 by Hong Kong (now HKSAR) as an extension to the Big5 character set, because: 1) Big5 does not include characters neccessary for non-specialist use in Hong Kong, since it was invented in Taiwan, and 2) there wasn't a standard extension to Big5 to address those needs, only a myriad of incompatible vendor extensions. GCCS has 3,049 characters, which includes the following kinds: 1) placenames in Hong Kong, 2) Cantonese dialectal characters, 3) Japanese, 4) simplified Chinese (jiantizi), and 5) graphic variants (yitizi) of characters already in Big5. Of these 3,049 characters, about 1,500 are not in Unicode, which presents some conversion problems. (It also means that Unicode is insufficient for Hong Kong use, and does not support Cantonese.)
Korean KS X 1001:1992 Hangul + Hanja Hangul are syllables, and have internal structure, they are made of jamo Hanja are Chinese characters. Perhaps fortunately, they are less used than they were
Encoding methods of CJKV ISO-2022 A modal encoding, that is, it is either in 1-byte mode or in 2-byte mode. Designator sequence – indicates which 2-byte character set is meant when 2-byte mode is on. Single shift sequence, turn on 2-byte mode for one char only Shifting character (toggle 1-2 byte mode) Escape sequence, require a specific 2-byte char set and then invoke it
Encoding methods of CJKV EUC More elaborate finite-state machine for shifting to different parts of the set.
Credits gtce-icu20-te2.pdf gtce-icu20-te2.pdf ware/info/cjk-codes/GB.html ware/info/cjk-codes/GB.html Anna, Martin, Peggy, Shravan, Mary, Kyuchul, Henry Thompson, Richard Tobin