From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.

Slides:



Advertisements
Similar presentations
Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
Advertisements

Chapter 6 Writing a Program
Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.
Globalization Gotchas
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.
Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.
26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Memory Management and Protection Part 2: The hardware view dr.ir. A.C. Verschueren.
1 The C Language An International Standard CIS2450 Professional Aspect of Software Engineering.
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
IT 325 OPERATING SYSTEM C programming language. Why use C instead of Java Intermediate-level language:  Low-level features like bit operations  High-level.
10/20: Lecture Topics HW 3 Problem 2 Caches –Types of cache misses –Cache performance –Cache tradeoffs –Cache summary Input/Output –Types of I/O Devices.
Today’s lecture Review of Chapter 1 Go over homework exercises for chapter 1.
Bruce Beckles University of Cambridge Computing Service
What's new in Microsoft Visual C Preview
Origins  clear a replacement for DES was needed Key size is too small Key size is too small The variants are just patches The variants are just patches.
Formal Language, chapter 4, slide 1Copyright © 2007 by Adam Webber Chapter Four: DFA Applications.
What is a pointer? First of all, it is a variable, just like other variables you studied So it has type, storage etc. Difference: it can only store the.
Data Manipulation Overview and Applications. Agenda Overview of LabVIEW data types Manipulating LabVIEW data types –Changing data types –Byte level manipulation.
George Blank University Lecturer. CS 602 Java and the Web Object Oriented Software Development Using Java Chapter 4.
Chapter 10.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Guide To UNIX Using Linux Third Edition
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
©UCB CPSC 161 Lecture 5 Prof. L.N. Bhuyan
Examining the Code [Reading assignment: Chapter 6, pp ]
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
© The McGraw-Hill Companies, 2006 Chapter 1 The first step.
CS 355 – Programming Languages
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
Basic Java Syntax CSE301 University of Sunderland Harry R Erwin, PhD.
Copyright © 2007 Addison-Wesley. All rights reserved.1-1 Reasons for Studying Concepts of Programming Languages Increased ability to express ideas Improved.
Netprog: Java Intro1 Crash Course in Java. Netprog: Java Intro2 Why Java? Network Programming in Java is very different than in C/C++ –much more language.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Fundamentals of C and C++ Programming. EEL 3801 – Lotzi Bölöni Sub-Topics  Basic Program Structure  Variables - Types and Declarations  Basic Program.
Java Language Basics By Keywords Keywords of Java are given below – abstract continue for new switch assert *** default goto * package.
Fall 2015CISC/CMPE320 - Prof. McLeod1 CISC/CMPE320 Assignment 4 is due Nov. 20 (next Friday). After today you should know everything you need for assignment.
 2008 Pearson Education, Inc. All rights reserved. 1 Arrays and Vectors.
Fall 2015CISC/CMPE320 - Prof. McLeod1 CISC/CMPE320 Assignment 1 due Friday, 7pm. RAD due next Friday. Presentations week 6. Today: –More details on functions,
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
Characters CS240.
Announcements You will receive your scores back for Assignment 2 this week. You will have an opportunity to correct your code and resubmit it for partial.
Array and Pointers An Introduction Unit Unit Introduction This unit covers the usage of pointers and arrays in C++
Chapter Nine Strings. Char vs String Literals Size of data types: Size of data types: –sizeof(“hello\n”)7 bytes –sizeof(“hello”)6 bytes –sizeof(“X”)2.
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
7-Nov Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Oct lecture23-24-hll-interrupts 1 High Level Language vs. Assembly.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16 This talk discusses the need to support.
Type Checking Generalizes the concept of operands and operators to include subprograms and assignments Type checking is the activity of ensuring that the.
A Closer Look at Instruction Set Architectures
CPSC 315 – Programming Studio Spring 2012
A Closer Look at Instruction Set Architectures
TOPICS Information Representation Characters and Images
Portability CPSC 315 – Programming Studio
An overview of Java, Data types and variables
Moving applications to HDF
Fundamentals of Data Representation
Comp Org & Assembly Lang
Presentation transcript:

From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16

Why is this an issue? The concept of the Unicode standard changed during its first few years Unicode 2.0 (1996) expanded the code point range from 64k to 1.1M APIs and libraries need to follow this change and support the full range Upcoming character assignments (Unicode 3.1, 2001) fall into the added range

Unicode is a 16-bit character set Concept: 16-bit, fixed-width character set Saving space by not including precomposed, rarely-used, obsolete, … characters Compatibility, transition strategies, and acceptance forced loosening of these principles Unicode 3.1: >90k assigned characters

16-bit APIs APIs developed for Unicode 1.1 used 16-bit characters and strings: UCS-2 Assuming 1:1 character:code unit Examples: Win32, Java, COM, ICU, Qt/KDE Byte-based UTF-8 (1993) mostly for MBCS compatibility and transfer protocols

Extending the range Set aside two blocks of 1k 16-bit values, surrogates, for extension 1k x 1k = 1M = additional code points using a pair of code units 16-bit form now variable-width UTF-16 Unicode scalar values 0..10ffff 16 Proposed: 1994; part of Unicode 2.0 (1996)

Parallel with ISO ISO uses 31-bit codes: UCS-4 UCS-2: 16-bit codes for subset 0..ffff 16 UTF-16: transformation of subset 0..10ffff 16 UTF-8 covers all 31 bits Private Use areas above 10ffff 16 slated for removal from ISO for UTF interoperability and synchronization with Unicode

21-bit code points Code points (Unicode scalar values) up to 10ffff 16 use 21 bits 16-bit code units still good for strings: variable-width like MBCS Default string unit size not big enough for code points Dual types for programming?

C: char/wchar_t dual types C/C++ standards: dual types Strings mostly with char units (8 bits) Code points: wchar_t, bits Typical use in I18N-ed programs: (8-bit) char strings but (16/32-bit) wchar_t (or 32-bit int ) characters; code point type is implementation-dependent

Unicode: dual types, too? Strings could continue with 16-bit units Single code points could get 32-bit data type Dual-type model like C/C++ MBCS

Alternatives to dual 16/32 types UTF-32: all types 32 bits wide, fixed-width UTF-8: same complexity after range extension beyond just the BMP, closer to C/C++ model – byte-based Use pairs of 16-bit units Use strings for everything Make string unit size flexible 8/16/32 bits

UCS-2 to UTF-32 Fixed-width, single base type for strings and code points UCS-2 programming assumptions mostly intact Wastes at least 33% space, typically 50% Performance bottleneck CPU - memory

UCS-2 to UTF-8 UCS-2 programming assumes many characters in single code units Breaks a lot of code Same question of type for code points; follow C model, 32-bit wchar_t ? – More difficult transition than other choices

Surrogate pairs for single chars Caller avoids code point calculation But: caller and callee need to detect and handle pairs: caller choosing argument values, callee checking for errors Harder to use with code point constants because they are published as scalar values Significant change for caller from using scalars

Strings for single chars Always pass in string (and offset) Most general, handles graphemes in addition to code points Harder to use with code point constants because they are published as scalar values Significant change for caller from using scalars

UTF-flexible In principle, if the implementation can handle variable-width, MBCS-style strings, could it handle any UTF-size as a compile- time choice? Adds interoperability with UTF-8/32 APIs Almost no assumptions possible Complexity of transition even higher than of transition to pure UTF-8, performance?

Interoperability Break existing API users no more than necessary Interoperability with other APIs: Win32, Java, COM, now also XML DOM UTF-16 is Unicode default: good compromise (speed/ease/space) String units should stay 16 bits wide

Does everything need to change? String operations: search, substring, concatenation, … work with any UTF without change Character property lookup and similar: need to support the extended range Formatting: should handle more code points or even graphemes Careful evaluation of all public APIs

ICU: some of all Strings: UTF-16, UChar type remains 16- bit New UChar32 for code points Provide macros for C to deal with all UTFs: iteration, random access, … C++ CharacterIterator: many new functions Property lookup/low-level: UChar32 Formatting: strings for graphemes

Scalar code points: property lookup Old, 16-bit: UChar u_tolower(UChar c){ u[v[c ]+c 6..0 ]; } New, 21-bit: UChar32 u_tolower(UChar32 c){ u[v[w[c ]+c 9..4 ]+c 3..0 ]; }

Formatting: grapheme strings Old: void setDecimalSymbol(UChar c); New: void setDecimalSymbol(const UnicodeString &s);

Codepage conversion To Unicode: results are one or two UTF-16 code units, surrogates stored directly in the conversion table From Unicode: triple-stage compact array access from 21-bit code points like property lookup Single-character-conversion to Unicode now returns UChar32 values

API first… Tools and basic functions and classes are in place (property lookup, conversion, iterators, BiDi) Public APIs reviewed and changed (luxury of early project stage) or deprecated and superseded by new versions Higher-level implementations to follow before Unicode 3.1 published

More implementations follow… Collation: need to prepare for >64k primary keys Normalization and Transliteration Word/Sentence break iteration Etc. No non-BMP data before Unicode 3.1 is stable

Other libraries Java: planning stage for transition Win32: rendering and UniScribe API largely UTF-16-ready Linux: standardizing on 32-bit Unicode wchar_t, has UTF-8 locales like other Unixes for char* APIs W3C: standards assume full UTF-16 range

Summary Transition from UCS-2 to UTF-16 gains importance after four years of standard APIs for single characters need change or new versions String APIs: no change Implementations need to handle 21-bit code points Range of options

Resources Unicode FAQ: Unicode on IBM developerWorks: ICU: