Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium
The Unicode Standard, Version 5.0 Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years. Donald E. Knuth For more than a decade, Unicode has been a foundation for many Microsoft products and technologies; Unicode Standard Version 5.0 will help us deliver important new benefits to users. Bill Gates The path W3C follows to making text on the Web truly global is Unicode. Sir Tim Berners-Lee, KBE Without Unicode, Java wouldn't be Java, and the Internet would have a harder time connecting the people of the world. James Gosling
The Unicode Standard, Version 5.0 Obsoletes previous versions Basis for Microsoft's Vista; in upgrade plans for Google, Yahoo!, and ICU, to name but a few. Hundreds of pages of new information; thousands of revised pages; all Unicode Standard Annexes Systematic framework for improved text processing Improvements to the Unicode Encoding Model for UTF-8, … Rigorous stability of case folding and identifiers Improved interoperability and backward compatibility Enabling additional new ways to optimize code
U5.0 Unicode Character Database Unicode: far more than a list of characters Properties: key to how characters function Changes in 5.0 Scripts: Unassigned code points Zzzz Casing Stability: Upper folded BIDI: Consistent Bidi_Mirrored Now Normative: kIICore Line Break: SE Asian Complex_Context New Properties: Normative_Name_Alias, Deprecated, 3 Unihan provisional properties General99,089 Private Use137,468 Surrogate2,048 Noncharacter66 Reserved875,441
U5.0 Conformance Stable Case-Folded Upper Lower Much clearer encoding / property model Stable Approved Named Character Sequences Bengali, Gurmukhi, Tamil changes Combining grapheme joiner clarified Disunification of Diacritics
5.0 Annexes: Core UAX #9: Bidirectional Algorithm Tightened conformance requirements UAX #15: Unicode Normalization Forms New Stream-Safe Text Format Appendix of characters requiring special handling Expanded info on stability guarantees Additional detailed figures, guidelines UAX #31: Identifier and Pattern Syntax Added profiles & information on usage
U5.0 Annexes: Boundaries UAX #14: Line Breaking Properties Rules modified to improve behavior Now Normative (conformance clauses reorganized) UAX #29: Text Boundaries Edge cases improved Tailorings for text boundaries now in Unicode CLDR Format of the rules changed to ease implementation Additional guidelines on regex, identifiers,…
U5.0 Characters by Script
Unicode Character Timeline
Unicode Guide for Programmers Adjunct to Standard Concise Guide for Software Globalization Crucial Concepts Key Gotchas Recognize and Avoid Details on Encoding & conversions: UTF-8, 16, 32 & BOM Using character properties Text Operations
Unicode Common Locale Data Repository: CLDR Key locale data for world languages Most extensive standard repository of locale data XML format Δευτέρα, 05 Σεπτεμβρίου 2005 Montag, 5. September , ,57руб. Arabic – arabski Bulgarian – bułgarski Czech – czeski … Africa – Central America – Eastern Africa – Northern Africa – … AED – د.إ. BHD – د.ب. DZD – د.ج. EGP – ج.م. EUR – … Z < Å
Unicode CLDR languages and 142 territories – 360 locales in all 25% more locale data; over 17,000 new/modified items Repository separated into language vs locale data Language-specific segmentation (word/line breaks…) Transliterations (eg Ελληνικά Ellēniká) Data for lenient date/time formatting and parsing Programmer asks for numeric day + abbreviated month Best format pattern returned, eg dd.MMM + Quarters in dates (eg 2006Q1) BCP 47 compatibility + extensions
BCP 47 Language Tags Usage: HTTP, HTML, XML; CLDR Locale ID s… RFC 4646; Obsoletes RFCs 1766, 3066 Addresses problems in RFC3066 ISO standards: stability / accessibility / ambiguity Parseability, Extensibility; Registration speed Identification of script (where necessary): Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.
Unicode Security Examples: Visual Confusables: paypal.com with Cyrillic a… Non visual problems: buffer overflows, non-shortest form,… UTR# 36 Unicode Security ConsiderationsUnicode Security Considerations Guidelines & Recommendations UTS# 39. Unicode Security MechanismsUnicode Security Mechanisms Algorithms & Data Limitations on Repertoire Testing for Confusables
Internationalized Domain Names One instance of broad problem Many RFCs use Nameprep – limited to Unicode 3.2 Unicode recommendations Narrow the repertoire: exclude symbols, punctuation Expand the coverage: currently only Unicode 3.2. IETF idn-nextsteps published Some positive developments, but misreads Unicode, needs more work
URL IRI International Resource Identifier (IRI) UTF-8, %-escaped Example: JP /.html JP%E7%B4%8D... %E8%B1%86.html See
Ideographic Variation Database U+82A6 ashi: multiple forms The first occurrence – any glyph Second occurrence is in the name of the town Ashiya – customarily displayed with form #4 Registration for variants
Ideographic Variation Database Variation Selector Identifies a restriction on the appearance of a character Character + Variation Selector = Variation Sequence Han ideographs Impossible to build a single collection for everyone: requirements from scholars, governments and publishers… Instead, registration of multiple independent collections Unicode Ideographic Variation Database A given variation sequence is used in at most one collection Makes interchange of variation sequences reliable. Registration, not Assessment
ICU 3.6 Mature, portable C/C++/Java intl libraries Unicode 5.0, UCA 5.0, CLDR 1.4 ICU4C Charset Detection Improved: Time Zones, Thai word break, UText (64 bit), Performance, Data Management,… ICU4J Globalization Preferences Flexible date/time formats*, Charset conversion*
Near-Term Issues Unicode 5.0.1, Unicode 5.1 CLDR / BCP 47bis LDAP Collation Registry IANA Charset Registry
Unicode possibilities Characters CJK Unified Ideographs Extension C Minority Scripts: Cham and Lanna Malayalam chillu … Properties/Behavior Normalization process for stable strings …
CLDR 1.5 / BCP 47bis CLDR 1.5 Data Submission Starting November New structures / data BCP 47 Adding ~7,000 (!) new language subtags Possibly other changes…
LDAP Now has definitive comparison(good) Stuck at Unicode 3.2(bad)
Collation Registry Nearing approval Adds ability to register comparisons Workable for basic cases draft-newman-i18n-comparator-14.txt
IANA Charset registry Currently limited usefulness Ill-defined Missing mapping tables Incomplete Inaccurate Regime Change Hope for future improvements!
Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium