Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft
What are supplementary characters? "a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate" 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) High/low surrogate? High: U+D800 - U+DBFF Low: U+DC00 - U+DFFF Terminology: "surrogate pair" preferred over "surrogate character“ See http://www.trigeminal.com/16to32AndBack.asp 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) Conversion example #1 Example #1: The first character in the Surrogate range (D800, DC00) as UTF-32: 1. D800: binary 1101100000000000 (lower ten bits: 0000000000) 2. DC00: binary 1101110000000000 (lower ten bits: 0000000000) 3. Concatenate 0000000000+0000000000 = x0000 4. Add x10000 Result: U+10000. This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF) 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) Conversion example #2 Example #2. You have a Unicode character such as U+2040A (a CJK character in Plane 2) and wish to encode it in UTF-16 1. Subtract x10000 - Result: 1040A 2. Split into two ten-bit pieces: 0001000001 0000001010 3. Add 1101100000000000 (D800) to the high 10 bits piece (0001000001) - Result: 1101100001000001 (D841) 4. Add 1101110000000000 (DC00) to the low 10 bits piece (0000001010) - Result: 1101110000001010 (DC0A) Your surrogate pair: D841, DC0A 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) UTF-8 conversions Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately) legal conversions: four-byte UTF-8 (one UTF-32 code point) CESU-8 is the the inverse of the above 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) UTF-8 example Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx Instead, you should take a Unicode surrogate pair: 110110wwwwzzzzyy, 110111yyyyxxxxxx and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1): 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx 24-26 March 2003 Prague, Czech Republic (IUC23)
Encoding choices for MS UTF-16, mostly Occasionally UTF-8 Even more occasionally, UTF-32 REASONS: There was obviously an existing, well-tested set of APIs that support UCS-2, which is a subset of UTF-16. A completely new API set was not required. A move to UTF-32 would require twice as much space for all characters. A move to UTF-8 would require even more than twice as much space in many cases. 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) The products... Mostly the new generation of products: Windows 2000/XP Office XP (some support in Office 2000) Visual Studio.Net .NET’s Common Language Runtime (CLR) Most (all) of these products supported Unicode already a little bit of extra work needed for supplementary characters usually just UTF-8 changes were needed 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) Windows 2000 Uniscribe support for rendering Each surrogate pair is a single grapheme APIs like CharPrev/CharNext not changed No specific surrogate font/IME Must be turned on: http://msdn.microsoft.com/library/en-us/intl/unicode_192r.asp 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) Windows XP *.* from Windows 2000 Turned on by default! GDI+ support for rendering Font CMAP extensions Lots of UTF-8 issues fixed No specific surrogate font/IME (yet) Extensions to fallback fonts [limited]: HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane1 HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane2 HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane3 (etc.) 24-26 March 2003 Prague, Czech Republic (IUC23)
Other system components MLang Internet Explorer http://i18nWithVB.com/surrogate_ime/ IIS 5.0/6.0 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) The downlevel story No good support for Unicode, let alone supplementary characters Uniscribe/RichEdit does improve the downlevel story for display purposes Officially, no support on Win9x 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) The Office suite Word Frontpage Excel/Access Outlook RichEdit 4.0 24-26 March 2003 Prague, Czech Republic (IUC23)
Office - Specific Features Insertion/Deletion of text - All Cursor movement - All Font linking/fallback - All (Word's is best) UTF-8 issues fixed - All Enhanced word breaking - All (Word/RichEdit) Vertical text - Word/PowerPoint/Publisher/RichEdit Direct entry (Alt+nnnnnn, hhhhh + Alt+x) - Word/RichEdit 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) CHS/CHT/CHP Office The product and the langpacks support an extended Unicode IME that handles supplementary characters An Extension B font is also included 24-26 March 2003 Prague, Czech Republic (IUC23)
.NET CLR/Visual Studio.NET String class and globalization namespace StringInfo GetTextElementEnumerator Handles supplementary characters Also handles composite characters GDI+ VS IDE support 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) SQL Server Past - no support (for Unicode, even!) Present - surrogate "safe" (neutral) Future - surrogate “aware” 24-26 March 2003 Prague, Czech Republic (IUC23)
Items not [currently] supported Character Map Graph 10 Outlook 10 mail headers Fonts/IMEs “Collations” for supplementary characters 24-26 March 2003 Prague, Czech Republic (IUC23)
Collation plan for supplementary characters in the UCA? All Plane-1 (non-ideographic) characters sort after all the other non-ideographic scripts but before the ideographs. All Plane 2 (ideographic) characters will be sorted after all the ideographs on the BMP. All Plane 3-14 (currently not assigned) will be treated like any other unassigned characters. Plane 14 language tags will be treated as if they were unassigned. All characters encoded in Plane 15-16 (private use) will be sorted after all other characters. 24-26 March 2003 Prague, Czech Republic (IUC23)
Prague, Czech Republic (IUC23) Questions? 24-26 March 2003 Prague, Czech Republic (IUC23)
Supplementary Character Support in Microsoft Products Don’t forget to fill out your evals! 24-26 March 2003 Prague, Czech Republic (IUC23)