Download presentation
Presentation is loading. Please wait.
Published byChristopher Stevens Modified over 11 years ago
1
Supplementary Character Support in Microsoft Products
Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft
2
What are supplementary characters?
"a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate" 24-26 March 2003 Prague, Czech Republic (IUC23)
3
Prague, Czech Republic (IUC23)
High/low surrogate? High: U+D800 - U+DBFF Low: U+DC00 - U+DFFF Terminology: "surrogate pair" preferred over "surrogate character“ See 24-26 March 2003 Prague, Czech Republic (IUC23)
4
Prague, Czech Republic (IUC23)
Conversion example #1 Example #1: The first character in the Surrogate range (D800, DC00) as UTF-32: 1. D800: binary (lower ten bits: ) 2. DC00: binary (lower ten bits: ) 3. Concatenate = x0000 4. Add x10000 Result: U This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF) 24-26 March 2003 Prague, Czech Republic (IUC23)
5
Prague, Czech Republic (IUC23)
Conversion example #2 Example #2. You have a Unicode character such as U+2040A (a CJK character in Plane 2) and wish to encode it in UTF-16 1. Subtract x Result: 1040A 2. Split into two ten-bit pieces: 3. Add (D800) to the high 10 bits piece ( ) - Result: (D841) 4. Add (DC00) to the low 10 bits piece ( ) - Result: (DC0A) Your surrogate pair: D841, DC0A 24-26 March 2003 Prague, Czech Republic (IUC23)
6
Prague, Czech Republic (IUC23)
UTF-8 conversions Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately) legal conversions: four-byte UTF-8 (one UTF-32 code point) CESU-8 is the the inverse of the above 24-26 March 2003 Prague, Czech Republic (IUC23)
7
Prague, Czech Republic (IUC23)
UTF-8 example Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx Instead, you should take a Unicode surrogate pair: 110110wwwwzzzzyy, yyyyxxxxxx and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1): 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx 24-26 March 2003 Prague, Czech Republic (IUC23)
8
Encoding choices for MS
UTF-16, mostly Occasionally UTF-8 Even more occasionally, UTF-32 REASONS: There was obviously an existing, well-tested set of APIs that support UCS-2, which is a subset of UTF-16. A completely new API set was not required. A move to UTF-32 would require twice as much space for all characters. A move to UTF-8 would require even more than twice as much space in many cases. 24-26 March 2003 Prague, Czech Republic (IUC23)
9
Prague, Czech Republic (IUC23)
The products... Mostly the new generation of products: Windows 2000/XP Office XP (some support in Office 2000) Visual Studio.Net .NET’s Common Language Runtime (CLR) Most (all) of these products supported Unicode already a little bit of extra work needed for supplementary characters usually just UTF-8 changes were needed 24-26 March 2003 Prague, Czech Republic (IUC23)
10
Prague, Czech Republic (IUC23)
Windows 2000 Uniscribe support for rendering Each surrogate pair is a single grapheme APIs like CharPrev/CharNext not changed No specific surrogate font/IME Must be turned on: 24-26 March 2003 Prague, Czech Republic (IUC23)
11
Prague, Czech Republic (IUC23)
Windows XP *.* from Windows 2000 Turned on by default! GDI+ support for rendering Font CMAP extensions Lots of UTF-8 issues fixed No specific surrogate font/IME (yet) Extensions to fallback fonts [limited]: HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane1 HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane2 HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane3 (etc.) 24-26 March 2003 Prague, Czech Republic (IUC23)
12
Other system components
MLang Internet Explorer IIS 5.0/6.0 24-26 March 2003 Prague, Czech Republic (IUC23)
13
Prague, Czech Republic (IUC23)
The downlevel story No good support for Unicode, let alone supplementary characters Uniscribe/RichEdit does improve the downlevel story for display purposes Officially, no support on Win9x 24-26 March 2003 Prague, Czech Republic (IUC23)
14
Prague, Czech Republic (IUC23)
The Office suite Word Frontpage Excel/Access Outlook RichEdit 4.0 24-26 March 2003 Prague, Czech Republic (IUC23)
15
Office - Specific Features
Insertion/Deletion of text - All Cursor movement - All Font linking/fallback - All (Word's is best) UTF-8 issues fixed - All Enhanced word breaking - All (Word/RichEdit) Vertical text - Word/PowerPoint/Publisher/RichEdit Direct entry (Alt+nnnnnn, hhhhh + Alt+x) - Word/RichEdit 24-26 March 2003 Prague, Czech Republic (IUC23)
16
Prague, Czech Republic (IUC23)
CHS/CHT/CHP Office The product and the langpacks support an extended Unicode IME that handles supplementary characters An Extension B font is also included 24-26 March 2003 Prague, Czech Republic (IUC23)
17
.NET CLR/Visual Studio.NET
String class and globalization namespace StringInfo GetTextElementEnumerator Handles supplementary characters Also handles composite characters GDI+ VS IDE support 24-26 March 2003 Prague, Czech Republic (IUC23)
18
Prague, Czech Republic (IUC23)
SQL Server Past - no support (for Unicode, even!) Present - surrogate "safe" (neutral) Future - surrogate “aware” 24-26 March 2003 Prague, Czech Republic (IUC23)
19
Items not [currently] supported
Character Map Graph 10 Outlook 10 mail headers Fonts/IMEs “Collations” for supplementary characters 24-26 March 2003 Prague, Czech Republic (IUC23)
20
Collation plan for supplementary characters in the UCA?
All Plane-1 (non-ideographic) characters sort after all the other non-ideographic scripts but before the ideographs. All Plane 2 (ideographic) characters will be sorted after all the ideographs on the BMP. All Plane 3-14 (currently not assigned) will be treated like any other unassigned characters. Plane 14 language tags will be treated as if they were unassigned. All characters encoded in Plane (private use) will be sorted after all other characters. 24-26 March 2003 Prague, Czech Republic (IUC23)
21
Prague, Czech Republic (IUC23)
Questions? 24-26 March 2003 Prague, Czech Republic (IUC23)
22
Supplementary Character Support in Microsoft Products
Don’t forget to fill out your evals! 24-26 March 2003 Prague, Czech Republic (IUC23)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.