Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft.

Slides:



Advertisements
Similar presentations
Supplementary Character Support in Microsoft Products
Advertisements

Unicode and Keyboards on Windows
Globalization Gotchas
26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer.
Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
26 April 2001 Unicode and Visual Basic, IUC 18 (Hong Kong) Unicode and Visual Basic: A Case Study Michael S. Kaplan Software Design Engineer Trigeminal.
26 April 2001 Unicode and Collation Support in MS SQL Server, IUC 18 (Hong Kong) Unicode and Collation Support in Microsoft SQL Server Michael S. Kaplan.
Advanced.Net Framework 2.0 David Ringsell MCPD MCSD MCT MCAD.
Unicode and Windows XP Cathy Wissink Program Manager Globalization Infrastructure, Design and Development Windows International Microsoft.
Free Pascal compiler internationalisation Rimgaudas Laucius Institute of Mathematics and Informatics, Vilnius University Lithuania.
A-Level Computing#BristolMet Session Objectives#8 express numbers in binary, octal and hexadecimal explain the use of code to represent a character set.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.

8 November Forms and JavaScript. Types of Inputs Radio Buttons (select one of a list) Checkbox (select as many as wanted) Text inputs (user types text)
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
15 September How Computers Work: Other Forms of Data.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
1 An Introduction to Visual Basic Objectives Explain the history of programming languages Define the terminology used in object-oriented programming.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
Supporting Complex Scripts (such as Arabic and Hebrew) in your Windows 2000™ Application F. Avery Bishop Senior Program Manager Microsoft Corporation.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Unicode & W3C Jataayu Software C. Kumar January 2007.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
Unicode (and Java) Brice Giesbrecht.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Microsoft Visual Basic 2005: Reloaded Second Edition
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
Chapter 1: A First Program Using C#. Programming Computer program – A set of instructions that tells a computer what to do – Also called software Software.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Microsoft Visual Basic 2008: Reloaded Third Edition Chapter One An Introduction to Visual Basic 2008.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
Characters CS240.
1 CSC160 Chapter 1: Introduction to JavaScript Chapter 2: Placing JavaScript in an HTML File.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.
Objectives  Explain the basic Unicode concepts in plain language  Install SILConverters 4.0  Install the converters for your branch  Convert several.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
Data Representation COE 308 Computer Architecture
Binary Representation in Text
Binary Representation in Text
Week 2 - Wednesday CS 121.
Machine level representation of data Character representation
Data Representation ICS 233
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16 This talk discusses the need to support.
Data Representation.
Characters & Fonts Digital Multimedia, 2nd edition
Data Representation COE 301 Computer Organization
Fundamentals of Data Structures
Characters & Fonts Digital Multimedia, 2nd edition
Fundamentals of Data Representation
COMS 161 Introduction to Computing
Data Representation ICS 233
Data Representation COE 308 Computer Architecture
Presentation transcript:

Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft

12 September 2002San Jose, California (IUC22) What are supplementary characters? " a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate "

12 September 2002San Jose, California (IUC22) High/low surrogate? High: U+D800 - U+DBFF Low: U+DC00 - U+DFFF Terminology: – "surrogate pair" preferred over "surrogate character“ See 16to32AndBack.asp

12 September 2002San Jose, California (IUC22) Conversion example #1 Example #1: – The first character in the Surrogate range (D800, DC00) as UTF-32: 1. D800: binary (lower ten bits: ) 2.DC00: binary (lower ten bits: ) 3.Concatenate = x Add x10000 Result: U This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF)

12 September 2002San Jose, California (IUC22) Conversion example #2 Example #2. – You have a Unicode character such as U+2040A (a CJK character in Plane 2) and wish to encode it in UTF-16 1.Subtract x Result: 1040A 2.Split into two ten-bit pieces: Add (D800) to the high 10 bits piece ( ) - Result: (D841) 4.Add (DC00) to the low 10 bits piece ( ) - Result: (DC0A) Your surrogate pair: D841, DC0A

12 September 2002San Jose, California (IUC22) UTF-8 conversions Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately) legal conversions: four-byte UTF-8 (one UTF-32 code point) CESU-8 is the the inverse of the above

12 September 2002San Jose, California (IUC22) UTF-8 example Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx Instead, you should take a Unicode surrogate pair: wwwwzzzzyy, yyyyxxxxxx and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1): 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

12 September 2002San Jose, California (IUC22) Encoding choices for MS UTF-16, mostly Occasionally UTF-8 Even more occasionally, UTF-32 REASONS: There was obviously an existing, well-tested set of APIs that support UCS-2, which is a subset of UTF-16. A completely new API set was not required. A move to UTF-32 would require twice as much space for all characters. A move to UTF-8 would require even more than twice as much space in many cases.

12 September 2002San Jose, California (IUC22) The products... Mostly the new generation of products: – Windows 2000/XP – Office XP (some support in Office 2000) – Visual Studio.Net Most (all) of these products supported Unicode already – a little bit of extra work needed for supplementary characters – usually just UTF-8 changes were needed

12 September 2002San Jose, California (IUC22) Windows 2000 Uniscribe support for rendering Each surrogate pair is a single grapheme APIs like CharPrev/CharNext not changed No specific surrogate font/IME Must be turned on:

12 September 2002San Jose, California (IUC22) Windows XP *.* from Windows 2000 Turned on by default! GDI+ support for rendering Font CMAP extensions Lots of UTF-8 issues fixed No specific surrogate font/IME (yet) Extensions to fallback fonts [limited]: HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane1 HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane2 HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane3 (etc.)

12 September 2002San Jose, California (IUC22) Other system components MLang Internet Explorer IIS 5.0/6.0

12 September 2002San Jose, California (IUC22) The downlevel story No good support for Unicode, let along supplementary characters Uniscribe/RichEdit does improve the downlevel story for display purposes Officially, no support on Win9x

12 September 2002San Jose, California (IUC22) The Office suite Word Frontpage Excel/Access Outlook RichEdit 4.0

12 September 2002San Jose, California (IUC22) Office - Specific Features Insertion/Deletion of text - All Cursor movement - All Font linking/fallback - All (Word's is best) UTF-8 issues fixed - All Enhanced word breaking - All (Word/RichEdit) Vertical text - Word/PowerPoint/Publisher/RichEdit Direct entry (Alt+nnnnnn, hhhhh + Alt+x) - Word/RichEdit

12 September 2002San Jose, California (IUC22) CHS/CHT/CHP Office The product and the langpacks support an extended Unicode IME that handles supplementary characters An Extension B font is also included

12 September 2002San Jose, California (IUC22) Visual Studio[.NET] String class and globalization namespace StringInfo GetTextElementEnumerator – Handles supplementary characters – Also handles composite characters GDI+ IDE support

12 September 2002San Jose, California (IUC22) SQL Server Past - no support (for Unicode, even!) Present - surrogate "safe" (neutral) Future - surrogate “aware”

12 September 2002San Jose, California (IUC22) Items not [currently] supported Character Map Graph 10 Outlook 10 mail headers Fonts/IMEs “Collations” for supplementary characters

12 September 2002San Jose, California (IUC22) Collation plan for supplementary characters in the UCA? All Plane-1 (non-ideographic) characters sort after all the other non-ideographic scripts but before the ideographs. All Plane 2 (ideographic) characters will be sorted after all the ideographs on the BMP. All Plane 3-14 (currently not assigned) will be treated like any other unassigned characters. Plane 14 language tags will be treated as if they were unassigned. All characters encoded in Plane (private use) will be sorted after all other characters.

12 September 2002San Jose, California (IUC22) Questions?

12 September 2002San Jose, California (IUC22) Supplementary Character Support in Microsoft Products Don’t forget to fill out your evals!