1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.

Slides:



Advertisements
Similar presentations
Unicode from a distance…
Advertisements

Ruby M17N RubyKaigi 08 RubyKaigi 08 Martin J. D ü rst.
Information Technology Quiz Questions with Answers Part 11
2017/3/25 Test Case Upgrade from “Test Case-Training Material v1.4.ppt” of Testing basics Authors: NganVK Version: 1.4 Last Update: Dec-2005.
Slide 1 Insert your own content. Slide 2 Insert your own content.
Globalization Gotchas
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Unicode Normalization Mark Davis
Software Re-engineering
1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
Software Re-engineering
Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 28.
© Bowne Global Solutions, Inc All rights reserved Bowne Global Solutions and OLIF Industry Implementation Michael Kranawetvogl Linguistic Engineering Bowne.
Re-rendering of Deposit History Bin Han Digital Repository Developer Concordia University Libraries Tomasz Neugebauer Digital Projects.
Lis508 lecture 1: bits, bytes and characters Thomas Krichel
Combining Like Terms. Only combine terms that are exactly the same!! Whats the same mean? –If numbers have a variable, then you can combine only ones.
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Chris Pratley Lead Program Manager Microsoft Office.
Open-Source Approaches to Unicode Enablement Panel Discussion.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Open-Source Approaches to Unicode Enablement Panel Discussion.
0 - 0.
Teacher Name Class / Subject Date A:B: Write an answer here #1 Write your question Here C:D: Write an answer here.
Addition Facts
CS4026 Formal Models of Computation Running Haskell Programs – power.
1 Java Card Technology Prepared by:Ali Toyserkani Adopted from: Introduction to Java Card Technology C. Enrique Ortiz.
Time, Clocks, and the Ordering of Events in a Distributed System
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
June 2004 Adil Allawi Technical Director
Eiffel: Analysis, Design and Programming Bertrand Meyer (Nadia Polikarpova) Chair of Software Engineering.
Access Lesson 13 Programming in Access Microsoft Office 2010 Advanced Cable / Morrison 1.
Natural Business Services for Construct Users Mark Barnard R&D Manager – Natural Business Services.
1 Programming the Web: HTML Basics Computing Capilano College.
INTRODUCTORY MICROSOFT WORD Lesson 4 – Formatting Text
DRIS/BP Task Group Report, Madrid, Sergey Parinov, TG leader Barbara Ebert, deputy TG leader.
31242/32549 Advanced Internet Programming Advanced Java Programming
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
Unicode and Windows XP Cathy Wissink Program Manager Globalization Infrastructure, Design and Development Windows International Microsoft.
Presented by Douglas Greer Creating and Maintaining Business Objects Universes.
Addition 1’s to 20.
Test B, 100 Subtraction Facts
Presentation 7 part 2: SOAP & WSDL.
Week 1.
Computer Concepts BASICS 4th Edition
Graphics 2D 1 Subject:T0934 / Multimedia Programming Foundation Session:6 Tahun:2009 Versi:1/0.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
IT Systems What Number? EN230-1 Justin Champion C208 –
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
San Jose, California – September, 2002 Transliteration of Indic Scripts Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Anlab ( ) Kim, Yangjung Characters & Fonts.
Cupertino, CA, USA / September, 2000First ICU DeveloperWorkshop1 Transformation Support Alan Liu Globalization Center of Competency IBM Emerging Technology.
Markus W. Scherer IBM Cupertino August 20 th, 2001Globalizing eBusiness – SDForum Unicode and XML Globalizing eBusiness Tools of the Trade: Unicode and.
Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 ICU Low-level Utilities and Resource Management Vladimir Weinstein Globalization Center.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
Chapter 3: I Need a Tour Guide (Introduction to Visual Basic 2012)
Characters & Fonts Digital Multimedia, 2nd edition
Characters & Fonts Digital Multimedia, 2nd edition
Presentation transcript:

1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha Globalization Center of Competency, Cupertino, CA

2 Agenda Introduction Terminology & Concepts Problems Solutions & Tools Summary

3 Introduction Text data used to be contained on a single computer system Now text data is exchanged among different systems Each type of system used different ways to encode text Exchanging this text data requires a conversion Text data is increasingly machine processed Main emphasis on Unicode text processing

4 Terminology System Character set Code point Encoding/Charset Character mapping Alias

5 Concept of Character Mapping A Á A UnicodeCharacter Set fallback roundtrip

6 Character Mapping (continued) VV UnicodeCharacter Set reverse fallback roundtrip

7 Doing a Conversion Unicode ISO Repertoires: superset/subset

8 Doing a Conversion (continued) Unicode IBM SJIS Sun SJIS 99.8% Same

9 Text Data Exchange Problems Unable to read text from another system Unable to write correct text for other processes Loss of text data because of mistakes –Maybe partial loss of data due to rare and obscure details –Happens more often to multibyte and stateful encodings New Unicode character added and mapping changes –Character was mapped to PUA –Character is now mapped to a new Unicode character

10 Problems (continued) Support of different repertoires of characters Different text encoding models –Different bidi text models Visual order Logical order Explicit embedding –Composed and decomposed characters –Shaping (Arabic) –Reordering (Indic, Thai, etc.) –Ligatures different

11 Examples µμMicro symbol (U+00B5) vs. Greek Mu (U+03BC) -–Hyphen-Minus (U+002D) vs. En Dash (U+2013) \¥Backslash (U+005C) vs. Yen symbol (U+00A5) ~¯Tilde (U+007E) vs. Overline (U+00AF) NULGraphical display of control characters NLLFNewline swapped with Linefeed ISO Control rotation 0x1C 0x1A0x7F

12 Reasons For Problems Different mappings tables Fallback supported inconsistently Mapping tables were not shared Mappings tables were not published in machine readable format Aliases –Existing registries (IANA, MIME, …) do not specify precise mappings –Different mapping tables for the same name (CP943, SJIS) –Different names for the same character set

13 Solutions Use precise names Use precise mapping tables Avoid fallbacks (controllable e.g. with ICU) Share the character set mappings –e.g. format:

14 Solutions (continued) Do safe conversions –Exact subsets and supersets –Use precise replacements for unavailable characters (NCRs and escapes) –Algorithmic JIS X 0208: SJIS EUC-JP ISO 2022-JP All Unicode encodings among each other

15 Tools ICU (International Components for Unicode) –Feature rich converter API –Allows to match conversion behavior of most other systems – Unicode mapping table repositories – – iconv() and other platform converters

16 Summary Text data exchange can result in loss of data Using Unicode is safe without a conversion Conversion mapping tables are unsafe Use ICU Thank you for listening Are there any questions?