Globalization Gotchas

Slides:



Advertisements
Similar presentations
Unicode from a distance…
Advertisements

Supplementary Character Support in Microsoft Products
Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.
Unicode Security Mark Davis. The Unicode Consortium Software globalization standards: define properties and behavior for every character in every script.
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Unicode Normalization Mark Davis
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.
26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer.
Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Keys to Building a Multilingual Search Engine Thierry Sourbier.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
EXtensible HyperText Markup Language Miruna Bădescu Finsiel Romania Copenhagen, 25 May 2004.
1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
SPECIAL TOPIC XML. Introducing XML XML (eXtensible Markup Language) ◦A language used to create structured documents XML vs HTML ◦XML is designed to transport.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
1. Discrete / Continuous Representations Of numbers – binary & decimal Bits Hexadecimal - 'Hex' Representing text Bits and Bytes.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
Chapter 8 Bits and the "Why" of Bytes: Representing Information Digitally.
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Representing Information in Binary (Continued)
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
2.1.4 BINARY ASCII CHARACTER SETS A451: COMPUTER SYSTEMS AND PROGRAMMING.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Domain Names, Internationalization, and Alternatives John C KLENSIN © John C Klensin, 2002.
Higher Computing Data Representation.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
ASCII and Unicode.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Implementation Issues Mark Davis Properties.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Week 7 Lecture 2 Globalization Support in the Database.
The character data type char. Character type char is used to represent alpha-numerical information (characters) inside the computer uses 2 bytes of memory.
Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?
Module 7: SQL Server Special Considerations. Overview SQL Server High Availability Unicode.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
Characters CS240.
Text INTRODUCTION TO ASP.NET. InterComm Campaign Guidelines CONFIDENTIAL Simply Server side language Simplified page development model Modular, well-factored,
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Chapter 8 & 11: Representing Information Digitally
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16 This talk discusses the need to support.
INTERNATIONALIZATION
Bits and the "Why" of Bytes: Representing Information Digitally
LING 388: Computers and Language
Fundamentals of Data Representation
ASCII and Unicode.
Presentation transcript:

Globalization Gotchas Mark Davis

Unicode Basics Unicode encodes characters, not glyphs: U+0067 → g g g g g g g g g g g g g. ... Unicode does not encode characters by language: French, German, English j have the same code point even though all have different pronunciations Chinese 大 (da) has the same code point as Japanese 大 (dai). UTF-8, UTF-16, and UTF-32 are all Unicode. The word character means different things to different people: make clear which one you mean. glyphs, code points, bytes, code units, user-perceived characters (grapheme clusters),…

Unicode in APIs U+0000 to U+10FFFF: Be prepared to handle (at least not corrupt!) any incoming code points A back-level system may get unassigned code points from later versions. Watch for "UCS-2" implementations. They use UTF-16 text, but don't support characters above U+FFFF; they also may accidentally cause isolated surrogates. Some APIs/protocols will count lengths in code points, and others in bytes (or other code units). Make sure you don't mix them up. Don't limit API parameters to a single character (and definitely not to a single code unit!). What users think of as a single character (e.g. ẍ, ch) may be a sequence in Unicode. Use the latest version of Unicode: supports new characters, corrections, more stability guarantees.

Choice of Characters Character and block names may be misleading, eg, U+034F COMBINING GRAPHEME JOINER doesn't join graphemes. ► http://www.unicode.org/faq/ Use U+2060 (word joiner) instead of U+FEFF (zero-width nobreak space) for everything but the BOM function. Never use unassigned code points; those will be used in future versions of Unicode. Only use private use (PUA) or non-characters (and only if necessary) If you do, minimize the opportunity for collision by picking an unusual range.

Character Conversion Always use "shortest form" UTF-8. It's the Law. And if that isn’t enough, consider security attacks. If a protocol allows a choice of charsets, always tag correctly Not all text is correctly tagged: character detection may be necessary. But remember, it's always a guess! Converting a database of mixed, untagged data is extremely painful. Bad assumptions: Length [bytes] = N * length [code points] 1 character [charset X] = 1 character [Unicode] The ordering may also be different.

Character Conversion II IANA / MIME charset names are ill-defined: vendors often convert same charset different ways. Shift-JIS: 0x5C → U+005C (\) or U+00A5 (¥) Don’t simply omit unconvertable data; to reduce security problems, at least substitute: U+FFFD (when converting to Unicode) or 0x1A (when converting to bytes). ► http://www.w3.org/TR/japanese-xml/ ► http://icu.sourceforge.net/charts/charset/

Properties Use properties such as Alphabetic, not hard-coded lists: isAlphabetic(x) regex: \p{Alphabetic} or [:Alphabetic:] Not (“A” ≤ x ≤ “Z” OR “a” ≤ x ≤ “z”) Some properties aren't what you think; use: White_Space not General_Category=Zs Alphabetic not General_Category=L Lowercase not General_Category=Ll Script=Greek not Block=Greek Characters may change property values between versions of Unicode ► http://unicode.org/standard/stability_policy.html

Identifiers & Tokens When designing syntax, use as a base: Pattern_Syntax for operators / relations Pattern_Whitespace for gaps XID_Start and XID_Continue for identifiers. All backwards compatible across versions Profiles may expand or narrow from the base Watch out for security attacks: “paypal.com” with a Cyrillic “a” ► See Unicode Security at this conference

Comparison (Collation): Searching, Sorting, Matching There are two binary orders: code point order = UTF-8 order = UTF-32 order ≠ UTF16 order Don’t present users with binary order! No users expect A < Z < a < z < Ç < ä. Apply normalization to get a unique form, so Å = Å. Security Issues: Protocols must precisely define the comparison operations: Eg, LDAP doesn't, so lookup may fail (or falsely succeed!) Aside from wrong results, opening for security attacks.

Language-Sensitive Comparison Use UCA Order as a base to meet user-expectations: a < A < ä < Ç = C◌̧ < z < Z Real language-sensitive order requires tailoring on top of UCA; ordering depends on context and language: china < China < chinas < danish ae < æ < af z < æ (Danish) c < d < ... h < ch < i (Slovak) Follow UCA for substring match offsets – some gotchas here. Don't mix up "stable" and "deterministic" sorting: they are very different. ► http://unicode.org/reports/tr10/ ► http://unicode.org/cldr

Normalization (NFC,…) Standardized normalized forms defined by Unicode. The ordering of accents in a normalization form may not be the typical type-in order. Fonts should handle both orders. Normalization is context independent Don't assume NFC(x + y) = NFC(x) + NFC(y) People assume that NFC always composes, but some characters decompose in NFC. Trivia: In Unicode 4.1 there are exactly 3 characters that are different in all 4 normalization forms: ϓ, ϔ, ẛ

Maximum Expansion (U4.1) Operation UTF Factor Sample NFC 8 3X 𝅘𝅥𝅮 U+1D160 16, 32 שּׁ U+FB2C NFD ΐ U+0390 4X ᾂ U+1F82 NFKC / NFKD 11X ﷺ U+FDFA 18X

Case Conversion Not a simple 1:1 mapping Title case: dz ↔ DZ ↔ Dz Expansion: heiß → HEISS → heiss Context-dependent: ΌΣΟΣ → όσος Language-dependent: istanbul ↔ İSTANBUL Warning: never use language-dependent casing for language-independent structures, like file-system B-Trees.

Casing: Maximum Expansion Operation UTF Factor Sample Lower 8 1.5X Ⱥ U+023A 16, 32 1X A U+0041 Upper / Title / Fold 8, 16, 32 3X ΐ U+0390

Case Conversion II Case folding was not stable. Different results from toCaseFold(S) between two versions Stability now guaranteed in Unicode 5.0 Don't use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of  General_Category These were constrained to be in a partition. Use the separate binary properties Lowercase and Uppercase instead.

Lowercase / Uppercase: Form vs Function Lowercase, the binary property: The character is lowercase in form, but not necessarily in function. Functionally Lowercase: isCased(x) & isLowercase(x). See Section 3.13 of TUS.

Lowercase: Form vs Function LC F. LC Ll Count Examples (U4.1) Y N 114 ˠ U+02E0 MODIFIER LETTER SMALL GAMMA 705 ª U+00AA FEMININE ORDINAL INDICATOR 43 ⅰ U+2170 SMALL ROMAN NUMERAL ONE 903 a U+0061 LATIN SMALL LETTER A

Segmentation What a user thinks of as a characters is often a sequence. Words are not just sequences of letters. Lines don’t just break at spaces All may be language-dependent ► http://www.unicode.org/reports/tr14/ ► http://www.unicode.org/reports/tr29/   

Transliteration Transliteration Ελληνικά ↔ Ellēniká ≠ Translation Ελληνικά ↔ Greek Transliteration may vary by language: Путин ↔ Putin, Poutine, ... Горбачёв ↔ Gorbachev, Gorbacev, Gorbatchev, Gorbačëv, Gorbachov, Gorbatsov, Gorbatschow, ... Watch for terminology: “lossy” vs “lossless” Lossy transliteration: Ελληνικά → Ellinika → Ελλινικα In ISO terms: “transliteration” = lossless transliteration “transcription” = lossy transliteration. ► http://unicode.org/draft/reports/tr35/tr35.html

Rendering is Contextual Processing character-by-character gives the wrong results! Glyphs may change shape Multiple characters → 1 glyph One character → multiple glyphs

Rendering II Good rendering systems will handle customary type-in order for text plus canonical order. Excellent ones will do any canonically-equivalent order, but those are rare. There may be differences in the customary glyphs for different languages; specify the font or the language where they have to be distinguished Security Issues: Never render a missing glyph as "?“. Don't simply overlay diacritics: it can cause security problems. ► http://www.unicode.org/notes/tn2/ ► http://unicode.org/reports/tr14/

Globalization Unicode ≠ Globalization (aka Internationalization, Localizability) Unicode provides the basis for software globalization, but there's more work to be done... Use globalization APIs: Formatting and parsing of dates, times, numbers, currencies; comparison of text; calendar systems; ... are locale-dependent. Where OS facilities are not adequate or cross-platform solutions are needed, use ICU (C, C++, Java) Don't put any translatable strings into your code; separate into resource files. Provide context to translators: is Mark a noun, a verb, or a name… Don’t use the same string in different contexts unless the meaning is identical (including references). Note: User-Interface language (menus, dialog, help-system,...) ≠ Data language (body text, spreadsheet cells). Programs need to handle, as data, more languages than in localized UI

Common Globalization Mistakes Never compile Windows apps as “ANSI” (the default!). Don't simply concatenate strings to make messages: Order of components differs by language: use Java MessageFormat, or structure UI as separate fields. Don't assume icons and symbols mean the same around the world. Don't assume everyone can read the Latin alphabet. Allocate space flexibly: “OK” in English → “Aceptar” in Spanish English is a relatively compact language; others may require more characters (eg in database fields) and more screen real estate (in UIs). Beware of discrepancies in “fallback” behavior: Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server Face (JSF), Apache HTTP,... ► http://unicode.org/cldr/ ► http://ibm.com/software/globalization/icu/

Neutral Formats Store and transmit neutral-format data wherever possible. Convert that data to the user's preferred formats as "close" to the user as possible. Type Example Rec. Standard Language/Locale* en-US (en_US) RFC 3066 bis / CLDR Territory AU RFC 3066 bis Currency EUR ISO 4217 Timezone Australia/Melbourne TZDB Calendar islamic-civil CLDR Calendar ID Custom Date yyyy-mmm-dd CLDR Pattern Format Binary Time 8C80E9E3967A4B0 Windows File Time

Identification Locale IDs are extensions of language IDs; use CLDR. ► http://unicode.org/cldr/ Don't assume that everyone in country always uses that country’s currency. Always use an explicit currency ID (ISO 4217). <RUR, 1.23457×10³> ↔ 1 234,57р. in Russian, but Rub 1,234.57 in English. Don't assume the timezone ID is implied by the user's locale. For the best timezone information, use the TZ database; use CLDR for timezone names. ► http://www.twinsun.com/tz/tz-link.htm If you heuristically compute territory IDs, timezone IDs, currency IDs, etc. (eg, from browser settings) make sure the user can override that and pick an explicit value.

Unicode Guide Authoritative but lightweight Introduction, overview, and quick reference Main principles of the Unicode Standard Best practices in Software Globalization

Other Resources Unicode Site: An Overview of ICU: http://unicode.org An Overview of ICU: http://icu.sourceforge.net/docs/papers/icu_overview_latest.ppt Globalizing Software: http://icu.sourceforge.net/docs/papers/globalizing_software.ppt W3C Internationalization: http://www.w3.org/International/ Microsoft Global Software Development http://www.microsoft.com/globaldev/default.asp

Q&A

Backup Slides

User Input  If you develop your own text editor, use the OS APIs to handle IMEs (Input Method Engines) for Chinese, Japanese, Korean,... If you are using "type-ahead" to get to a position in a list (eg typing "Jo" gets to the first element starting with those characters), allow arbitrary input. This is often easiest with visible fields. If your password field can contain characters that require an IME, a screen pop-up box may reveal the password to onlookers.

Dotted and Dotless I ⇄ ← → Uppercase Normal Lowercase Turkic I + ˙ i   ⇄ ← I + ˙ i 0049 0307 I 0069 İ 0049 0130 ı 0131 → i + ˙ İ + ˙ 0069 0307 0130 0307

Java In MessageFormat, watch for words like can't, since ASCII ' has syntactic meaning. Use a real apostrophe (U+2019) where possible: can’t. In Date and Calendar, the months are numbered from 0 (February is month number 1!). However, weeks and days are numbered from 1. Java serialized text isn't UTF-8, though it's close. U+0000 and supplementary code points are encoded differently. Java globalization support is pretty outdated: use ICU to supplement it. Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server Face (JSF), Apache HTTP server, etc. all provide some locale determination mechanism and facility; but they all differ in details.

JavaScript Always encode characters above U+007F with escapes (\uxxxx). There is an HTML mechanism to specify the charset of the Javascript source, but it is not widely implemented. The JDK tool native2ascii can be used to convert the files to use escapes