An Introduction Part I: Unicode and Character Encodings Internationalization An Introduction Part I: Unicode and Character Encodings
License This presentation and its associated materials licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 License. You may use these materials without obtaining permission from the author. Any materials used or redistributed must contain this notice. [Derivative works may be permitted with permission of the author.] This work is copyright © 2008-2012 by Addison P. Phillips
Who is this guy? Globalization Architect, Lab126 We make the technology behind the Kindle Chair, W3C Internationalization WG
Character Encodings The basics of text processing in software. Probably the most recognizable and most common internationalization activity is “enabling”. Character Encodings
The Biggest Source of Woe “Character encodings consume more than 80% of my work day. They are the source of more mis-information and confusion than any other single thing. And developers aren’t getting any better educated.” Glen Perkins Globalization Architect Character encodings are probably the single largest initial barrier (when considering remedial work) and the single largest source of support issues (when considering code in the field) related to internationalization. Understanding how text is encoded, what the various options are, and how to handle encodings is critical to producing internationalized software. This tutorial is not intended to replace the basic Unicode tutorials, but the information presented here is of such a critical nature that necessarily we’ll cover some of the same concepts and terminology.
A Lot of Jargon Multibyte kanji Variable width double-byte language Wide character Character encoding Coded character set Bidi or bidirectional Glyph, character, code unit Unicode kanji double-byte language extended ASCII ANSI, OEM encoding agnostic Characters, encodings, and their terminology inhabit a dense thicket of jargon, misinformation, and outright confusion. Just using the correct terminology greatly helps when dealing with the issues involved. For a comprehensive look at the terminology presented here, look at: The Character Model for the World Wide Web {aka CharMod}: http://www.w3.org/TR/CharMod Unicode: http://www.unicode.org
ÀéçЉД文字निخ 011010010101001010010101011101010101010110100010101011110101010111011011011010010101001010010101011101010101010110100010101011110101010111011011
010000010101101101101000 010000010101101101101000 (0x41) Code Unit byte Underneath everything else, computer systems are nothing more than a collection of little switches—tiny switches capable of holding just two positions “1” for on and “0” for off. Each possible state is called a bit and bits are all that computers really “understand” or know about. As a convenience for both the machine and the human, we usually consider the bits in groups rather than in isolation. For various historical reasons, the most common collection of bits considered as a unit is called a byte or an octet and takes eight bits. Computers were originally designed as calculating machines, so it’s common and useful to think of a series of bits a representing a number. There are 256 unique combinations of bits in a byte. Typically, these are the numbers fro 0 to 255, with all zeros representing 0 and all ones representing 255. A unit of physical storage and information interchange Other code units exist (16-bit, 32-bit, etc.)
Glyph À àààààààààààààà नि じ A خ A single shape (in text)
Grapheme À नि じ A خ A single visual unit of text: the smallest abstract unit of meaning in a writing system.
À Character न ि नि じ A خ A single logical unit of text Ni = na + “I” Zi = si + kuten A single logical unit of text
À Character Set Abstract Character Repertoire A set of characters
À Coded Character Set Code Point U+00C0 Code Point A set of characters in which each character is assigned a numeric identifier.
À Character Encoding Form 11000011 10000000 U+00C0 11000011 10000000 0xC3 0x80 UTF-8 Maps code points to code units
À U+00C0 ÀéçЉД文字निخ 11000011 10000000 0xC3 0x80 UTF-8
*(the most important slide in this presentation) In memory, on disk, on the network, etc. All text has a character encoding When things go wrong, start by asking what the encoding is, what encoding you expected it to be, and whether the bytes match the encoding.
Counting Things Be aware of whether you need to count graphemes, characters, or bytes (code units): Is the limit “screen positions”, “characters”, or “bytes of storage”? Should you be using a different limit? Which one are you actually counting? varchar(110) यूनिकोड (4 glyphs) य ू न ि क ो ड (7 characters) E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1 (21 bytes)
Common Encoding Problems Tofu hollow boxes Mojibake garbage characters Question Marks (conversion not supported) There are different symptoms for what can go wrong when working with text and encodings. The three most basic are: “Tofu”. These are hollow boxes, one per character, where one expects to see the specific characters. These usually represent a font problem: it is the computer’s way of telling you “I know what this character is, but don’t have a picture of it to show you.” Tofu isn’t always a bug in software, even though it is annoying. 2. “Question Marks” (which sometimes are underscores or blanks), one per character where one expects to see text. This usually represents a problem converting text from one character encoding to another. It is the computer’s way of telling you “this character doesn’t exist in the encoding I was converting to, so I replaced it with this thing.” Question marks aren’t always a bug in software, although they usually represent a place where better encoding support could be supplied. If you remember our “animal picture”, this might take place when reading data from, say, the text file into a template in a different encoding. 3. Between tofu and question marks is “mojibake” (moh gee bah kay), which is a Japanese word that means approximately “screen garbage”. Mojibake happens when you look at text in one encoding, interpreting the bytes as if they were in a different encoding.
It can happen to anyone…
Tofu Can appear as either hollow boxes (empty glyph) or as question marks (Firefox, for example) Not usually a bug: it’s a display problem Can mask or masquerade as character corruption.
When Good Characters Go Bad Mojibake When Good Characters Go Bad
Sources of Mojibake View text using the wrong encoding Apply a transfer encoding and forget to remove it Convert to an encoding twice Convert to or from the wrong encoding Overzealous escaping Conversion to entities (“entitization”) Multiple conversions
Character Encoding Forms Their theory, structure, and use
EBCDIC It’s worth mentioning that that not all modern encoding standards have ASCII as their base. This is the EBCDIC standard (and one of its regional variations), which was developed by IBM and is still used on many IBM computers. It has more or less the same characters as ASCII, but has assigned them to totally different numbers. Note that the EBCDIC character encoding encodes the same printable characters as the US-ASCII character encoding does. That is, they share the same character repertoire, even though they assign different integer values to those characters. Thus, they encode the same repertoire, but are different character sets and different character encodings.
ASCII 7 bits = 27 = 128 characters Enough for “U.S. English” The most basic character set for most computer systems is the US-ASCII set (ANSI X3.4, ISO 646, ECMA-6). The US-ASCII set contains the uppercase and lowercase letters used in English, as well at the digits zero through nine, a collection of punctuation symbols, and a few control characters, such as NULL, escape, and BEL. Here is the layout of the US-ASCII character set, showing the integer values assigned to each character. The US-ASCII character set contains only 128 unique characters. This means that it only requires 7 bits. While this character set was enough to support U.S. English on early teletypes, terminals, and other devices, it’s clear that it doesn’t support very many other languages. Since most computers use eight bit bytes, a carefully written program could use the addition 128 characters between 0x80 and 0xFF to represent additional characters. These encodings are sometimes called extended ASCII, although that term is highly inexact and should be avoided (there are literally thousands of encodings that are “extended ASCII” encodings of some sort).
Latin-1 (ISO 8859-1) ASCII for characters 0x00 through 0x7F Accented letters and other symbols 0x80 through 0xFF One common character encoding that extends US-ASCII is Latin-1. This encoding was standardized by the International Organization for Standardization or “ISO” as ISO 8859-1. Latin-1 assigns an additional 96 characters over and above those assigned by US-ASCII. It also reserves a range of characters call the “C1 controls” that match the 32 control characters in the lower, ASCII, part of the encoding (only with the 8th bit set). Latin-1 can adequately represent most Western European languages, such as German, Spanish, Italian, Swedish, and so forth. It also includes some symbols, such as the copyright symbol, the “cents” sign, and some others, that are useful to software programs or for certain languages.
Code Page Originally an IBM character encoding term. IBM numbered their character sets with “CCSIDs” (coded character set ids) and numbered the corresponding character encoding forms as “code pages”. Microsoft borrowed code pages to create PC-DOS. Microsoft defines two kinds of code pages: “ANSI” code pages are the ones used by Windows GUI programs. “OEM” code pages are the ones used by command shell/command line programs. Neither “ANSI” nor “OEM” refer to a particular encoding standard or standards body in this context. Avoid the use of ANSI and OEM when referring to encodings.
windows-1252 Windows’s encodings (called “code pages”) are generally based on standard encodings—plus some additional characters.
Beyond Single Byte Encodings So far we’ve been looking at single-byte encodings: one byte per character 1 byte = 1 character (= 1 glyph?) 256 character maximum Good enough for most alphabetic languages Some languages need more characters. What about the “double-byte” languages? Don’t those take two bytes per character? 丏丣並 So far we’ve seen character encodings that assign each distinct 8-bit byte a unique character value. These encodings can store up to 256 total characters, including any control characters, since an 8-bit byte has 0xFF (256) potential values. Obviously, some languages must have more extensive needs, since most developers have heard of something called “double-byte characters” or “double-byte languages”. What happens if a language needs more than 256 characters to represent the language (or more than 128 characters in addition to US-ASCII, since most encodings are based on the US-ASCII repertoire)? À
Beyond Single-Byte Escape sequences to select another character set Example: ISO 2022 uses escape sequences to select various encodings Use a larger code unit (“wide” character encoding) Example: IBM DBCS code pages or Unicode UTF-16 216 = 64K characters 232 = 4.2 billion characters Use a variable-width encoding Variable width encodings use different numbers of code units to represent different types of characters within the same encoding form.
Multibyte Encodings One or more bytes per character 1 byte != 1 character May use 1, 2, 3, or 4 bytes per character -> maximum number of bytes per character varies by encoding form. May use shift or escape sequences May encode more than one character set Single-byte encodings are a special case of multibyte! The majority of character encodings that use more than one byte per character are called multibyte encodings. A multibyte encoding can use one, two, three, or more bytes per character. The exact number of bytes per character is usually variable for the encoding. In other words, one character might take one byte and a different character might take two bytes to encode (and yet another might require three, and so on). The maximum number of bytes per character varies according the specific encoding (so some multibyte encodings require a maximum of only two bytes, for example). Single-byte encodings, such as US-ASCII or Latin-1, are a special case of multibyte in which the largest character requires only one byte. There are various types of multibyte encodings. We won’t examine them all here, but we’ll look at a couple of encodings so you can see how they work. Multibyte Encoding: “variable-width” encoding that uses the byte as its code unit.
Simple Multibyte Encoding Forms Specific byte ranges encode characters that take more than one byte. A “lead byte” One or more “trailing bytes” Code point != code unit A 0x41 single byte あ 0x82 0xA0 lead byte trail byte The simplest multibyte encodings are, of course, single-byte encodings. The next step up in complexity are the “simple” multibyte encodings. These encodings use a specific range of bytes to represent single byte characters (such as the ASCII range) and different byte values to encoding characters that require more than one byte each. Usually a multibyte character is introduced by a lead byte, which is a byte value that falls into a specific range. The lead byte is followed by one or more trail bytes. Unlike with single-byte encodings, multibyte encodings and multibyte character sets may not match up exactly. That is, the code point in the character set and the code units (that is, “the byte sequence”) used to represent the code point in the character encoding may not be the same.
Shift-JIS: A Multibyte Encoding In order to reach more characters, Shift-JIS characters start with a limited range of “lead bytes” These can be followed by a larger range of byte values (“trail byte”) The Shift-JIS character encoding is a good example of a simple multibyte encoding. Shift-JIS is a character encoding of the JIS X 201 and JIS X 208 character sets, which, in turn, are used to represent the Japanese language. JIS stands for “Japanese Industrial Standard”. Here is a picture of the Shift-JIS encoding. The gray shaded values represent the “lead bytes” for Shift-JIS. Each lead byte is then followed, in the case of Shift-JIS, by exactly one trailing byte. In the case of Shift-JIS, the trailing bytes fall into the range 0x40 through 0xFE. Notice that the characters in the range 0xA1 through 0xDF are single byte characters.
Shift-JIS Here are two of the additional “pages” of characters from JIS X 208 addressed by Shift-JIS. Lead-byte 0x82 assigns a number of kana characters, as well as some familiar looking characters: these look like the ASCII letters and numbers! The lead byte 0xE0 is a different example. In this case, there are a variety of kanji (or Han ideographic characters).
Shift-JIS Lead bytes can be trail byte values Trail bytes include ASCII values Trail bytes include special values such as 0x5C (“\”) int pos = strchr(mybuf, ‘@’); Notice a few things about Shift-JIS that might complicate processing: The lead byte range is included in the trailing byte range! This means that a random pointer into a Shift-JIS byte array (a char*, for you C programmers) can’t tell if it is looking at a lead or trail byte. This is (almost-but-not-quite*) always true of (non-single-byte) multibyte encodings. The trailing byte range includes single byte characters! This means that the same byte value might be included in a single-byte or a multibyte character. This isn’t true of all multibyte encodings. Some, such as the EUC encodings (Extended Unix Code: guess which operating system you often find this family of encodings on?) reserve a range of bytes for multibyte values which do not overlap at all with single-byte characters. In this case, the trailing byte range includes some characters, such as 0x5C (backslash), which have “special meaning” to some programs such as the shell or to the C compiler. This may require special handling on the part of the developer or of code in order to avoid problems. * One well-known exception is Unicode UTF-8, about which more later.
More Complex Multibyte Systems Example: IBM “MBCS” code pages [SI/SO shift between 1-byte and 2-byte characters] Example: ISO 2022 [escape sequence changes character set being encoded]
Ad hoc and Font Encodings
Common Encoding Conversion Tools and Libraries Process Output (HTML, XML, etc.) Templates ISO 8859-1 Content UTF-8 Data Shift_JIS Common Encoding Conversion Tools and Libraries iconv (Unix) ICU (C, C++, Java) perl Encode Java (native2ascii, IO/NIO) (etc.) Document formats often require a single character encoding be used for all parts of the document.
Encoding Conversion as Filter ISO 8859-1 ÀàС£ ISO 8859-1 ÀàС£ ?????? »èç????? ???? UTF-8 ÀàС£ ?????? »èç????? ???? UTF-8 детски »èçينس文字 Shift_JIS 文字化け ? (0x3F) is the replacement character for ISO 8859-1 Encoding conversion acts as a “filter” Replacement characters (“question marks”) replace characters from the source character set that are not present in the target character set.
Too Many Fish in the Sea Need for more converters and conversion maps Difficulty of passing, storing, and processing data in multiple encodings Too many character sets… Over time, more and more character encodings came into being. There were (and are) hundreds and hundreds of encodings, character sets, and minor vendor variations of each. Conversion from one to another required customized mapping tables, which might be very difficult to create and maintain: the mojibake problem was getting worse and worse. It got very hard to create, store, pass, and process data, especially since nothing seemed to be truly generic. This painful state of affairs was called “code page hell” (a code page is an IBM—and later Microsoft—term for a character encoding), and it made writing internationalized software very difficult.
Unicode / ISO-10646 So, back in the late 1980’s, some folks came up with the idea for Unicode the Universal Character Set (UCS) to solve the mojibake problem. The first version of Unicode was ready by the early 1990’s and has continued to evolve, adding more characters, ever since.
The Idea Behind Unicode A universal character set to encode the world’s scripts. Encodes characters (not glyphs). Consistent interchange and interpretation: one set of rules for all text, everywhere The idea behind Unicode, essentially, is to solve the mojibake problem. It created a whole new encoding standard that encompasses all the characters in all the writing systems in the world, not just those covered by other encoding standards, but those that haven’t yet had encoding standards created for them. And it gives every character its own unique bit pattern. This way, if you know everything is in Unicode, there’s no confusion as to what character a particular bit pattern standard for—there’s only one choice. And because every character is in Unicode, you can use it for internal processing and still interoperate with systems that use one of the other encodings (which we call legacy encodings). Of course, this means using more than one byte per character, but on modern machines with vast amounts of memory and storage space, that’s a small price to pay. Unicode also meant that, for the first time, there was a single “pivot point” that other encodings could use. Prior to Unicode, mapping between character encodings required substantial research and there was no guarantee that any two systems would do it quite the same way. I personally experienced this in the early 1990s, when the sole source for mapping a particular IBM multibyte encoding was an engineer in Colorado who had a PostScript file that defined the mapping. I got his phone number by writing to Ken Lunde (the expert on Asian character sets and encodings; you will want to own his CJKV Information Processing instead of writing him personally) and called him up to get the file. I then had to build a mapping to my target encoding. Such arcana was not the exception in this era before the Web and especially before Unicode.
Unicode: the Universal Character Set An organized collection of characters. A “coded character set”: each character has a code point aka Unicode Scalar Value (USV) U+0041 <= hex notation
Unicode or ISO 10646? Unicode and ISO 10646 are maintained in sync. Unicode is maintained by an industry consortium. ISO 10646 is maintained by the ISO. Unicode is an international standard. By mutual agreement, it is identical to ISO 10646 and sometimes one sees the ISO name referred to instead. Actually, “Unicode” refers specifically to the Unicode Consortium’s work, which is broader than just the character set and its encodings. Unicode has an address space of 21-bits, allowing it to address up to 1.1 million characters. These characters are divided into regions called planes. Each plane can contain 65,534 characters (plus the non-character 0x#FFFF, where # equals the plane number). There are 17 such planes, starting at zero and ending at 0x10. The first plane (plane 0) is called the Basic Multilingual Plane or BMP. 99.9% of the characters in average data walking about on the Earth are encoded in the BMP. Characters outside the BMP are called supplemental characters and reside in supplemental planes. (Wags sometimes call these the astral planes.)
Unicode Code space of up to 0x10FFFF (about 1.1 million) characters Currently encodes 110,116 characters
The Unicode Standard Core Standard (TUS) Reports (http://www.unicode.org/reports) Unicode Standard Annexes (UAX) Unicode Technical Standards (UTS) Unicode Technical Reports (UTR) Unicode Character Database (UCD) Unicode Technical Notes (UTN) [not part of standard]
Encodes the World’s Scripts Modern scripts Historical scripts Ancient and extinct scripts Minority languages Some fun stuff too!
Characters, Not Glyphs AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa AaAa
Characters, Not Glyphs: Han Unification “Unihan” unifies abstract Han ideographs, even if specific writing traditions (such as Japanese kanji vs. Simplified Chinese) appear different.
Encoding Work Continues Unicode 6.1 added 732 characters, including several new scripts. … but the pace of change has slowed and most living scripts are encoded.
Planes Unicode is divided into “planes” of code points 17 planes (0 through 0x10) 64K (65,535) code points per plane Plane 0 is called the Basic Multilingual Plane (BMP). Planes 1 through 0x10 are called supplementary planes: Plane 1: supplementary multilingual plane (SMP) Plane 2: supplementary ideographic plane (SIP) Plane 4: supplementary special-purpose plane (SSP) Planes 15,16: private use Supplementary planes: Contain additional, rarer characters Plane 1: Rare, historical, or extinct scripts: Cuneiform, Gothic, etc. Plane 2: Less common Chinese/Japanese/Korean ideographs
Scripts and Blocks Most characters belong to a script, a distinct writing system. Some characters, such as many of the punctuation characters, are used by multiple scripts Characters are assigned in Unicode to blocks. Most blocks are used to encode (and named for) a specific script. Some scripts have multiple blocks (Latin, Han ideographs)
Unicode Blocks Ranges of code points allocated together for assignment. Not all code points in a block are assigned (some reserved for future assignment)
Unicode Blocks See: http://www.unicode.org/charts Block names are stabilized and not always fully indicative of block usage Example: Phags-pa block
Various Character Types Unicode Controls Compatibility Characters Byte Order Mark Replacement Character Combining Marks Variation Selectors Private Use Surrogates
Å Combining Marks A + ˚ = Å a + ˆ + . = ậ a + . + ˆ = ậ Composition can create “new” characters Base + non-spacing (“combining”) characters A + ˚ = Å U+0041 + U+030A = U+00C5 a + ˆ + . = ậ U+0041 + U+0302 + U+0323 = U+1EAD a + . + ˆ = ậ U+0041 + U+0323 + U+0302 = U+1EAD Å Some characters can be represented in more than one way in Unicode. For example, the LATIN CAPITAL LETTER A WITH RING ABOVE (U+00C5) can be composed from the base letter LATIN CAPITAL LETTER A (U+0041) followed by the COMBINING MARK RING ABOVE (U+030A). Some characters, such as U+1EAD, are composed of more than one part and can be composed in more than one way.
glyph = base + vowel modifier Combining Marks: Thai Unicode คืออะไร? คื = ค + ื glyph = base + vowel modifier The latin script example on the previous page may strike you as “interesting but irrelevant”. After all, most keyboarding systems and software for Western European and Far East Asian languages produce only precomposed characters. You might never have seen a combining mark in the Some languages Example text from: http://www.thai-language.com/let/173. It translates as "The proposed motion was carried unanimously."
Combining Marks: Devanagari यूनिकोड क्या है? यू नि को ड य ू न ि क ो ड न + ि = नि This phrase is “What is Unicode?” from the Unicode web site.
Combining Marks: Tamil யூனிக்கோடு என்றால் என்ன? யூனிக்கோடு கோ க + ோ U+0B95 U+0BCB The latin script example on the previous page may strike you as “interesting but irrelevant”. After all, most keyboarding systems and software for Western European and Far East Asian languages produce only precomposed characters. You might never have seen a combining mark in the Some languages Example text from: http://www.thai-language.com/let/173. It translates as "The proposed motion was carried unanimously."
Byte Order Mark (BOM) U+FEFF Used to indicate the “byte-order” of UTF-16 code units 0xFE FF; 0xFF FE Also used as a Unicode signature by some software (Windows’s Notepad editor, for example) for UTF-8 0xEF BB BF Appears as a character or renders as junk in some formats or on some systems. Has an annoying secondary meaning: “zero width non-breaking space”
The Replacement Character U+FFFD Indicates a bad byte sequence or a character that could not be converted. Equivalent to “question marks” in legacy encoding conversions � there was a character here, but it is gone now
Compatibility Characters Many characters were included in Unicode for round-trip conversion compatibility with legacy encodings: ①②③45Ⅵ ¾Lj¼Nj½dž ︴︷︻︽﹁﹄ ヲィゥォェュ゙ ﺲ ﺳ ﻫ ﺽ ﵬ ﷺ fi fl ffi ffl ſt ﬔ Compatibility Characters includes presentation forms legacy encoding: a term for non-Unicode character encodings.
Half and Full Width Forms Compatibility characters for East Asian legacy encodings that vary in character “width” Half width forms ヲァィゥェォャ Abcdefh Full width forms ァアィイゥ ぁあぃいぅ ABCDefg
UTS#37 defines the Ideographic Variation Database (IVD) Variation Selectors UTS#37 defines the Ideographic Variation Database (IVD)
Unicode Controls
Private Use
Surrogate Code Points Reserved code points in two blocks needed for the UTF-16 character encoding. Don’t encode characters Never to be used as characters on their own
Unicode Properties code point name character class combining level bidi class case mappings canonical decomposition mirroring default grapheme clustering
The Unicode Character Database (UCD) ӑ (U+04D1) CYRILLIC SMALL LETTER A WITH BREVE letter non-combining left-to-right decomposes to U+0430 U+0306 Ӑ U+04D0 is uppercase (and titlecase)
Unicode's Encoding Forms
Unicode Encoding Forms UTF-32 Uses 32-bit code units. All characters are the same width. UTF-16 Uses 16-bit code units. BMP characters use one 16-bit code unit. Supplementary characters use two special 16-bit code units: a “surrogate pair”. UTF-8 Uses 8-bit code units (bytes!) It’s a multi-byte encoding! Characters use between 1 and 4 bytes. ASCII is ASCII in UTF-8 There are three main encodings of Unicode: UTF-32, UTF-16, and UTF-8. UTF-32 uses 32 bit code units. Since Unicode code points only use up to 21 bits, this means that UTF-32 can store every Unicode character in a single code unit. UTF-16 uses 16 bit code units. A sixteen bit code unit can address a full plane of Unicode, but can’t reach all 21 bits. To provide UTF-16 with the ability to address the full range of Unicode, there are some special code points, called surrogates, reserved in the Basic Multilingual Planes. These are divided into two sections: Low Surrogates and High Surrogates. A low surrogate followed by a high surrogate is called a surrogate pair. Surrogates have no other function: they exist solely for encoding supplemental characters using UTF-16. Since surrogate code points never do anything else, UTF-16 doesn’t have the lead- and trail- problems found with other encodings. UTF-8 is a multibyte encoding. Its code units are, as the name implies, eight bits long. A character encoded as UTF-8 can take one, two, three, or, at most, four bytes. One of the interesting design features of UTF-8 is that the US-ASCII repertoire of characters (the 7-bit ones) are themselves in UTF-8. That is, a 7-bit US-ASCII file is also a UTF-8 file. All other bytes in a UTF-8 file are non-ASCII (that is, they are all greater than 0x7F). This means that certain kinds of processes might be able to handle UTF-8 data when it is contained in a well-known ASCII-based file format.
Unicode Encodings Compared A (U+0041) UTF-32: 0x0000041 UTF-16: 0x0041 UTF-8: 0x41 À (U+00C0) UTF-32: 0x000000C0 UTF-16: 0x00C0 UTF-8: 0xC2 0x80 ቐ (U+1251) UTF-32: 0x00001251 UTF-16: 0x1251 UTF-8: 0xE1 0x89 0x91 𐌸(U+10338) 0x00010338 0xD800 0xDF38 0xF0 0x90 0x8C 0xB8 Let’s look at how the different encodings of Unicode compare. Here we see two characters. The first is an Ethiopic character in the Basic Multilingual Plane. This character has a Unicode Scalar Value of 0x1251, which we write using the “U+” notation as “U+1251”. Both UTF-32 and UTF-16 represent it using a single code unit. In UTF-8 this character uses three code units (8-bit bytes). The second character is one from the Gothic script, an archaic script encoded in the first supplemental plane of Unicode. This character has the Unicode Scalar Value of U+10338. UTF-32 still uses a single code unit. UTF-16 uses a surrogate pair (two 16-bit code units). And UTF-8 uses four 8-bit code units.
UTF-32 Uses 32-bit code units (instead of the more-familiar 8-bit code unit, aka the “byte”) Each character takes exactly one code unit. Let’s take a closer look at UTF-16. As noted, the BMP characters are represented by a single 16-bit code unit. Supplemental characters use a surrogate pair. Surrogates are special Unicode code points that are reserved in the Basic Multilingual Plane. These code points are not assigned to characters and there are two sets of them. The first set is called the “High Surrogates”, ranging from 0xD800 through 0xDBFF. The second set is called the “Low Surrogates”, ranging from 0xDC00 through 0xDFFF. Notice that, unlike the multibyte encodings we encountered earlier, these ranges do not overlap nor are they shared with any other Unicode characters. A surrogate serves only one function: allowing the UTF-16 encoding to (non-statefully) reach the supplemental characters. U+1251 ቑ 0x00001251 U+10338 𐌸 0x00010338
Advantages and Disadvantages of UTF-32 Easy to process each logical character takes one code unit can use pointer arithmetic Not as commonly used Not efficient for storage 11 bits are never used BMP characters are the most common—16 bits wasted for each of these Affected by processor architecture (Big-Endian vs. Little-Endian) Disallowed for HTML5
UTF-16 Uses 16-bit code units (instead of the more-familiar 8-bit code unit, aka the “byte”) BMP characters use one unit Supplementary characters use a “surrogate pair”, special code points that don’t do anything else. Let’s take a closer look at UTF-16. As noted, the BMP characters are represented by a single 16-bit code unit. Supplemental characters use a surrogate pair. Surrogates are special Unicode code points that are reserved in the Basic Multilingual Plane. These code points are not assigned to characters and there are two sets of them. The first set is called the “High Surrogates”, ranging from 0xD800 through 0xDBFF. The second set is called the “Low Surrogates”, ranging from 0xDC00 through 0xDFFF. Notice that, unlike the multibyte encodings we encountered earlier, these ranges do not overlap nor are they shared with any other Unicode characters. A surrogate serves only one function: allowing the UTF-16 encoding to (non-statefully) reach the supplemental characters. 0x1251 U+1251 ቑ 0xD800 0xDF38 U+10338 𐌸 High Surrogate Low Surrogate Unique Ranges! 0xD800-DBFF 0xDC00-DFFF
Advantages and Disadvantages of UTF-16 Most common languages and scripts are encoded in the BMP. Less wasteful than UTF-32 Simpler to process (excepting surrogates) Commonly supported in major operating environments, programming languages, and libraries May not be suitable for all applications Affected by processor architecture (Big-Endian vs. Little-Endian) Requires more storage, on average, for Western European scripts, ASCII, HTML/XML markup.
UTF-8 7-bit ASCII is itself All other characters take 2, 3, or 4 bytes each lead bytes have a special pattern trailing bytes range from 0x80->0xBF Corresponding Code Point Lead Byte Trail Bytes Here’s how Unicode maps to UTF-8, to give you a better idea of how that works. Notice the clever bit patterning: unlike most multibyte encodings, if you know that a text buffer is encoded in UTF-8, you can drop a pointer anywhere in the buffer and find where the characters begin and end. It also makes UTF-8 encoded text highly patterned, making it relatively easy to detect as an encoding. 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx < 0x80 < 0x800 < 0x10000 Supplementary
Advantages and Disadvantages of UTF-8 ASCII-compatible Default or recommended encoding for many Internet standards Bit pattern highly detectable (over longer runs) Non-endian Streaming C char* friendly Easy to navigate Multibyte encoding requires additional processing awareness Non-shortest form checking needed Less efficient than UTF-16 for large runs of Asian text
HTML Set Web server to declare UTF-8 in HTTP Content-Type header Declare UTF-8 in META tag header Actually use UTF-8 as the encoding!! <html lang="en" dir="ltr"> <head> <meta charset="utf-8"> <title>Вибір і застосування кодування</title>
It’s more than just a character set and some encodings… Working With Unicode
Unicode Properties, Annexes, and Standards Unicode provides additional information: Character name Character class “ctype” information, such as if it’s a digit, number, alphabetic, etc. Directionality (LTR, RTL, etc.) and the Bidi Algorithm Case mappings (UPPER, lower, and Titlecase) Default Collation and the Unicode Collation Algorithm (UCA) Identifier names Regular Expression syntaxes Normalization Compatibility information Many of these items are in the form of Unicode Technical Reports http://www.unicode.org/reports The Unicode Consortium provides a lot more than just the characters and their code points. They also provide names, character information (is it a digit? Uppercase? Lowercase? Titlecase? From a particular script? Is it punctuation? Is it right-to-left? Does the character get drawn backwards [mirrored] when drawn right-to-left? And so much more…)
Unicode 6.1 Annexes 9 Unicode Bidirectional Algorithm 11 East Asian Width 14 Unicode Line Breaking Algorithm 15 Unicode Normalization Forms 24 Unicode Script Property 29 Unicode Text Segmentation 31 Unicode Identifier and Pattern Syntax 34 Unicode Named Character Sequences 38 Unicode Han Database (Unihan) 41 Common References for Unicode Standard Annexes 42 Unicode Character Database in XML 44 Unicode Character Database
Ǻ UAX#15: Normalization Abc ABC abc abC aBc abc U+01FA U+00C5 U+0301 Unicode Normalization has to deal with more issues: single or multiple combining marks compatibility characters presentation forms Ǻ U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301 Abc ABC abc abC aBc abc
Four Normalization Forms Ǻ Form D canonical decomposition Form C canonical decomposition followed by composition Form KD kompatibility decomposition Form KC kompatibility decomposition followed by composition ways to represent: U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301
Normalization in Action Ǻ Original Form C Form D Form KC Form KD U+01FA U+0041 U+0301 U+030A U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+030A U+0301
Normalization: Not a Panacea Not all compatibility characters have a compatibility decomposition. Not all characters that look alike or have similar semantics have a compatibility decomposition. For example, there are many ‘dots’ used as a period. Not all character variations are handled by normalization. For example, upper, title, and lowercase variations. Normalization can remove meaning
UAX#9: Unicode Bidirectional Algorithm A Bit of Bidi UAX#9: Unicode Bidirectional Algorithm
Bi-directional Scripts Some scripts are written predominantly from left-to-right (LTR). Some scripts are written predominantly from right-to-left (RTL).
Other Writing Directions Writing direction is a separate consideration from text direction. Both of the texts shown here are “left-to-right”
Character Direction Unicode defines a character’s direction Left-to-right Right-to-left Neutral Characters can be “weakly” or “strongly” directional
Unicode Bidi Algorithm Depends on “base direction” Breaks text into “runs”
Embedding and “Logical Order” Characters are encoded in logical order. Visual order is determined by the layout. Override and bidi control characters “Indeterminate” characters
Bidirectional Embedding Paste in Arabic
Unicode Controls and Markup
Natural Language Processing
Unicode Collation Algorithm Defines a collation algorithm (UTS#10) Defines “DUCET” (Default Unicode Collation Element Table) Must be tailored by language and “locale” (culture) and other variations (maintained by CLDR): Language Swedish: z < ö German: ö < z Usage German Dictionary: öf < of German Telephone: of < öf Customizations Upper-first A < a Lower-First a < A
Line Breaking (UAX#14) Defines rules for general-purpose non-dictionary line-breaking. Tailored by language Doesn’t work for languages such as Thai that require morphological analysis (aka “a dictionary”)
Text Segmentation: Thai ญัตติที่เสนอได้ผ่านที่ประชุมด้วยมติเอกฉันท ญัตติที่เสนอได้ผ่านที่ประชุมด้วยมติเอกฉันท (word boundaries) The latin script example on the previous page may strike you as “interesting but irrelevant”. After all, most keyboarding systems and software for Western European and Far East Asian languages produce only precomposed characters. You might never have seen a combining mark in the Some languages Example text from: http://www.thai-language.com/let/173. It translates as "The proposed motion was carried unanimously."
Text Segmentation (UAX#29) Find grapheme, word, and line-break boundaries in text. Tailored by language Provides good basic default handling
Unicode Consortium Does some other things CLDR: Common Locale Data Repository ULI: Localization Interoperability ISO 15924: Script registry
“That’s great: I’ll just use Unicode” Remember “all text has an encoding”? user input via forms email data feeds existing, legacy data database instances uploads Use UTF-8 for HTML and Web forms Use UTF-8 in your APIs Check that data really is UTF-8 Control encoding via code; avoid hard-coding the encoding Watch out for legacy encodings Convert to Unicode as soon as practical. Convert from Unicode as late as possible. Wrap Unicode-unfriendly technologies
Back Ends, External Data Your System Map Your System APIs use Unicode encoding hide internal storage encoding Data Stores, Local I/O consider an encoding conversion plan Front Ends Back Ends, External Data Uses Unicode? If not, what encoding? Store the encoding! Convert to Legacy Unicode Interface Unicode Cloud API Detect / Convert Legacy Encoding Unicode Capture Encoding Detect / Convert Input
Summary
Character Encodings Code unit Code point Character Glyph/grapheme Multibyte encoding Tofu Mojibake Question Marks “All text has an encoding”
Unicode 17 planes of goodness 3 encodings 1.1 million potential code points 150,000 assigned code points 3 encodings UTF-32 UTF-16 UTF-8 Unicode Standard, Annexes, and Reports CLDR for language specific tailoring Unicode Character Database
Q&A Would you write the code for I18N on the whiteboard before you go? #define UNICODE #import I18N.h