San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.

Slides:



Advertisements
Similar presentations
Globalization Gotchas
Advertisements

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.
Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Keys to Building a Multilingual Search Engine Thierry Sourbier.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Open-Source Approaches to Unicode Enablement Panel Discussion.
TOPIC : MIME (Multipurpose Internet Mail Extensions ) By: Cecilia Gomes COSC 541,DATA COMMUNICATION SYSTEMS & NETWORKS Instructor: Prof. Anvari (SEU)
1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
DICOM INTERNATIONAL CONFERENCE & SEMINAR Oct 9-11, 2010 Rio de Janeiro, Brazil Building a DICOM Library in C# Victor Derks GE Healthcare.
The Binary Numbering Systems
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
Computer Science Basics CS 216 Fall Operating Systems interface to the hardware for the user and programs The two operating systems that you are.
Addition : _________________ Binary Numbers (contd)
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
San Jose, California – September, 2002 Transliteration of Indic Scripts Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Introduction to Networks CS587x Lecture 1 Department of Computer Science Iowa State University.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
Text and Graphics September 26, Unit 3.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge.
Implementation Issues Mark Davis Properties.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
DB-1: Understanding and Leveraging the Latest ODBC and JDBC Technology What’s new in OpenEdge® 10.1A? Rob Steward Director of Software Development.
1 Information Management DIG 3563 – Lecture 14 Data Formats J. Michael Moshell University of Central Florida Original image* by Moshell et al. Imagery.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
Performance tests Montpellier perfSONAR meeting, 2006 Performance tests of the perfSONAR Lookup Service and eXist DB XML based on various tests done by.
Info stored in computer (memory) Numbers All in binaray – can be converted to octal, hex Characters ASCII – 1-byte/char Unicode – 2-byte/char Unicode-table.com/en.
Introduction to XML XML – Extensible Markup Language.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
ICU Overview: The Open Source Unicode Library
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Building Database-backended Multilingual, Multimedia Data Repositories: The aAQUA Experience.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
Binary IO Writing and Reading Raw Data. Files Two major flavors of file: Text Binary.
GCSE ICT Data Transfer. Data transfer Users often need to transfer data between software packages or computers. Until relatively recently this was difficult.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
GCSE ICT Data Transfer.
INTERNATIONALIZATION
TOPICS Information Representation Characters and Images
Workshop on XML-Based Library Applications 5
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Real-World File Structures
Electronic Memory.
Chapter 3 - Binary Numbering System
Presentation transcript:

San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September nd International Unicode Conference2 Agenda Encodings in files and protocols –Not: Processing encoding forms Unicode “is too big” –Issues and non-issues How to reduce size of Unicode text –Choice of encoding –Optional compression Examples and comparisons

San Jose, California, September nd International Unicode Conference3 What is ICU? Internationalization libraries for C, C++, Java* –Open source – non-viral –Sponsored by IBM *Sun’s Java licenses an earlier ICU version; ICU4J updates it. Unicode standard compliant –full supplementary support Cross-platform; extensible and customizable High performance and thread-safe –Multiple locales in same thread – simultaneously Converters for all Unicode charsets & hundreds of legacy codepages

San Jose, California, September nd International Unicode Conference4 Encodings of Unicode Common Unicode character set External encodings –Files and protocols –Almost always byte-serialized –Character Encoding Schemes/charsets Processing encodings –Character Encoding Forms, often 16/32-bit –Different requirements –Topic for different presentation…

San Jose, California, September nd International Unicode Conference5 Unicode “is too big”? Perceived large size of Unicode text –Compared with legacy codepages Size matters –Low-speed connections (dial-up, mobile) –Little memory (PDA, cell phone, embedded) Size does not matter when… –Images & other binaries swamp text size –High-speed network –Temporary documents –Large amounts of memory

San Jose, California, September nd International Unicode Conference6 How big is it? Size depends on language/script Bytes/char for some language groups: LanguagesLegacyUTF-8UTF-16 Western/Latin112 Russian/Arabic122 Hindi/Thai132 CJK232

San Jose, California, September nd International Unicode Conference7 Legacy codepages Compact because –Designed for single/few languages –Few characters compared with Unicode Conversion problems –Fallback/substitution of unmappable chars –Mapping table differences –Loss of parts of text common Large number/size of mapping tables

San Jose, California, September nd International Unicode Conference8 Reduce Unicode text size Choice of encoding –Encodings designed for different purposes –Compactness vs. direct applicability vs. software support etc. General-purpose compression –Best on top of compact encoding –Not available in all applications

San Jose, California, September nd International Unicode Conference9 UTF-8/16 Designed for processing but all-purpose UTF-8: –Byte-based, ASCII-compatible –BMP: up to 3 bytes/char UTF-16 (BE/LE): –Byte-serialization of 16-bit form, not ASCII-compatible –BE/LE forms or Byte Order Mark –BMP: always 2 bytes/char

San Jose, California, September nd International Unicode Conference10 UTF-7 7-bit encoding designed for –Obsolete: now 8-bit-safe Partially ASCII-compatible BMP: 2.67 bytes/char plus overhead –Base64-encoded UTF-16BE Stateful

San Jose, California, September nd International Unicode Conference11 SCSU & BOCU-1 About as compact as legacy codepages –1 byte/char for small scripts, 2 for CJK; stateful –Compress short strings better than LZW (zip) etc. SCSU: –Limited * ASCII compatibility (initial state) –Complex state, many encoding choices –Indeterministic; arbitrary byte values –Established encoding, supported in Various tools & editors (SC UniPad), ICU, Symbian OS (cell phones/PDAs)

San Jose, California, September nd International Unicode Conference12 BOCU-1 BOCU-1: –Delta-encoding; avoids control codes –MIME text-compatible but not ASCII –Deterministic –Preserves binary order (for sorting, databases) –New encoding; supported by ICU

San Jose, California, September nd International Unicode Conference13 SCSU & BOCU-1 text sizes Average bytes/char relative to UTF-8 LanguagesSCSUBOCU-1 English/French100% Russian/Arabic55%60% Hindi40% Thai40%45% Japanese55%60% Korean85%75% Chinese70%75%

San Jose, California, September nd International Unicode Conference14 Encoding vs. compression For example: BOCU-1 with WinZip (sum of seven files) UTF-8BOCU-1 Uncompressed32024 bytes20723 bytes Compressed11659 bytes10722 bytes

San Jose, California, September nd International Unicode Conference15 Performance Converter performance –Roundtrip to/from UTF-16 with ICU: SCSU: 45%..125% of UTF-8 roundtrip time BOCU-1: 40%..160% of UTF-8 roundtrip time Depends on encoding ratio –Fast for small scripts, 1 byte/char Separate compression adds to I/O time Conversion time typically swamped by –Transmission (low-bandwidth connections) Shorter texts transmit faster! –Parsing/processing

San Jose, California, September nd International Unicode Conference16 Further considerations In-document encoding declarations require ASCII readability (XML, HTML) Protocol may limit byte values (SMTP) –TES required for some encodings base64 for SCSU or UTF-16 in s Increases text size Compression removes ASCII readability and uses arbitrary byte values

San Jose, California, September nd International Unicode Conference17 Conclusion UTF-8 and/or UTF-16 work in most cases Size of text often not critical When small text size needed: –Use SCSU or BOCU-1 –Consider compression –Make sure receiver can handle it

San Jose, California, September nd International Unicode Conference18 References Forms of Unicode: Character Encoding Model: UTR # SCSU: UTS #6 BOCU-1: nversion/bocu1/bocu1.html nversion/bocu1/bocu1.html ICU homepage: Unicode Consortium: IBM developerWorks: