An ICU Overview Mark Davis Chief Globalization Architect, IBM

Slides:

Advertisements

Similar presentations

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect

Advertisements

Draft Java/ICU Internationalization Architecture Mark Davis.

The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.

© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.

Open-Source Approaches to Unicode Enablement Panel Discussion.

June 2004 Adil Allawi Technical Director

26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager.

1 Frameworks. 2 Framework Set of cooperating classes/interfaces –Structure essential mechanisms of a problem domain –Programmer can extend framework classes,

Presented by IBM developer Works ibm.com/developerworks/ 2006 January – April © 2006 IBM Corporation. Making the most of Creating Eclipse plug-ins.

Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.

26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.

Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.

San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.

M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,

Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.

A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.

27th Internationalization and Unicode ConferenceBerlin, Germany, April 2005 ICU Overview The Open-Source Unicode Library, v3.2 Markus Scherer ICU Manager.

IBM Maximo Asset Management © 2007 IBM Corporation Tivoli Technical Exchange Calls Aug 31, Maximo - Multi-Language Capabilities Ritsuko Beuchert.

What’s New in Sage SalesLogix V Release Overview Sage SalesLogix v7.5.2 focuses on: −User Enhancements streamline the user experience furthering.

119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt

© 2006 IBM Corporation 29th Internationalization and Unicode Conference ICU Overview: The Open Source Unicode Library Markus Scherer IBM Globalization.

San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.

IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.

Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.

Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.

Cross Language Clone Analysis Team 2 October 13, 2010.

Week 7 Lecture 2 Globalization Support in the Database.

Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,

STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.

ICU Overview: The Open Source Unicode Library

San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.

1 Problem Solving  The purpose of writing a program is to solve a problem  The general steps in problem solving are: Understand the problem Dissect the.

The purpose of a CPU is to process data Custom written software is created for a user to meet exact purpose Off the shelf software is developed by a software.

Random Logic l Forum.NET l Localization & Globalization Forum.NET ● May 29, 2006.

Introduction to Algorithm. What is Algorithm? an algorithm is any well-defined computational procedure that takes some value, or set of values, as input.

1 January 14, Evaluating Open Source Software William Cohen NCSU CSC 591W January 14, 2008 Based on David Wheeler, “How to Evaluate Open Source.

Enterprise Wide Information Systems SAP R/3 Overview & Basis Technology Instructor: Richard W. Vawter.

Architecture Review 10/11/2004

Development Environment

Working with Java.

Featured Enhancements to the IDE & Debugger

Lecture 1 Introduction Richard Gesick.

SQL and SQL*Plus Interaction

Chapter 18 Maintaining Information Systems

SOFTWARE DESIGN AND ARCHITECTURE

Introduction to Operating System (OS)

Introduction to JSP Liu Haibin 12/09/2018.

Knowledge Byte In this section, you will learn about:

Data Modeling II XML Schema & JAXB Marc Dumontier May 4, 2004

CMPE419 Mobile Application Development

Chapter 2: Operating-System Structures

MSIS 655 Advanced Business Applications Programming

Operation System Program 4

Chapter 2: System Structures

ApplinX Rod Carlson Senior Technical Lead.

CIS16 Application Development – Programming with Visual Basic

Unicode from a distance…

What's New in eCognition 9

Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta

CSE451 Virtual Memory Paging Autumn 2002

Lecture 06:Software Maintenance

CMPE419 Mobile Application Development

What's New in eCognition 9

What's New in eCognition 9

Exceptions and networking

SDMX IT Tools SDMX Registry

Varying Character Lengths

Presentation transcript:

An ICU Overview Mark Davis Chief Globalization Architect, IBM What is ICU? An ICU Overview Mark Davis Chief Globalization Architect, IBM IBM Globalization Center of Competency 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

Agenda What is ICU? Architecture Overview Significant New ICU Features Near Future Features References Q and A ICU (International Components for Unicode) is a collaborative, open-source development project jointly managed by a group of companies and individual volunteers throughout the world, using the Internet and the Web to communicate, plan, and develop the software and documentation. It is sponsored and used by IBM. Comprehensive support for the Unicode Standard is the basis for multilingual, single- binary software. ICU uses the most current versions of the standard, and provides full support for supplementary characters. As computing environments become more heterogeneous, software portability becomes more important. ICU lets your produce the same results across all the various platforms you support. It offers great flexibility to extend and customize the supplied system services. For more information, see the ICU website. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

Why Globalization? 9/20/2018

Unicode All world languages Efficient and effective processing Lossless data exchange Enables single-binary global software But… all languages ⇒ large, complex standard 1,400 pages + Annexes + additional standards 90,000+ characters Major update every 3 years 70 character properties, many multi-valued Affects many processes: display, line-break, regex, … 9/20/2018

Locales Features vary widely across languages & countries Sorting, line breaks, date/time/number/currency formatting, codepage conversion, … Performance is key: easy to do the right thing; hard to do it fast 9/20/2018

What is ICU? Globalization / Unicode / Locales Mature, widely used set of C/C++ and Java libraries Basis for Java 1.1 internationalization – but goes far beyond Very portable – identical results on all platforms / programming languages C/C++: 30+ platforms/compilers Java: IBM & Sun JDK Full threading model; customizable; modular Open source – but not viral ICU 3.0: 78 languages; 118 countries; 870 codepages 9/20/2018

Who uses ICU? Products Within IBM PSD Print Architecture, DB2, COBOL, Host Access Client, InfoPrint Manager, Informix GLS version 4.0, iSeries, Lotus Notes, Lotus Extended Search, Lotus Workplace, MQ Integrator Endeavour, NUMA-Q, OTI, Pervasive Computing WECMS, SS&S Websphere Banking Solutions, Tivoli Presentation Services, WBI Adapter/ Connect/Modeler and Monitor/ Solution Technology Development/WBI-Financial TePI, Websphere Application Server/ Studio Workload Simulator/Transcoding Publisher, XML Parser Other Companies and Organizations Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business Objects, caris, CERN, Cognos, Debian, Gentoo, HP, Inktomi, JD Edwards, Jikes, Macromedia, Mathworks, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE, Sybase, Virage, webMethods, Wine, Leica Geosystems GIS & Mapping, LLC. 9/20/2018

ICU Features Unicode text handling Charset conversions (700+) What is ICU? ICU Features Unicode text handling Charset conversions (700+) Collation & Searching Locales (170+) Resource Bundles Calendar & Time zones Complex-text layout engine Unicode Regular Expressions Breaks: word, line, … Formatting Date & time Messages Numbers & currencies Transforms Normalization Casing Transliterations In addition to basic Unicode standard conformance, both ICU4J and ICU4C also provide a full set of internationalization features listed above. Notes on C/C++ vs. Java ICU C/C++ and Java APIs do differ slightly due to the differences of programming languages. Sometimes the feature development in ICU4C leapfrogs ICU4J or vice versa by 1-2 releases. Though JDK already supports codepage conversion natively in Java, there are more character conversion features available in ICU4C than in JDK. ICU4C character set conversion features are available to Java users via ICU4JNI. Since ICU is open source and closely tracks the Unicode Standard, ICU can support changes and additions to the Unicode Standard much more quickly than Java. Java support for Unicode is tied to major releases of the JDK, and can lag the Unicode Standard by a year or more. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

Architecture Overview 1 What is ICU? Architecture Overview 1 Locale Based Services Locale is an identifier, not a container Keywords for variants: de@collation=phonebook Resource inheritance: shared resources root Language en de zh Unlike POSIX locale model, the locale in ICU is just a light-weight identifier. The locale Ids are composed of language, country and variant information.It follows the ISO-639 and ISO- 3166 standards for the naming. For example, Italian and Italy are designated as it_IT. All the locale-sensitive ICU services use the locale information to determine language and other locale specific parameters of their function. Other parts of the library use the locale as an indicator to customize their behavior. The ICU service data uses the locale ID as a key to map to data. The service data can be added, removed or modified without source code changes. The locale object also defines the concept of a default locale. The default locale is the locale, used by many programs, that regulates the rest of the computer's behavior by default. It is set to the system locale on the platform at startup time in ICU. Just like in Java, the resource management mechanism adopts the resource bundle inheritance model based on locales. This is important since any information exists in the parent resource bundle is not duplicated in the child resource bundle. The look up mechanism works such that If a specific resource is not found in the requested bundle, we will try to locate the resource a more general bundle, default locale. If the resource can’t be found in the default bundle either, the "root" bundle is used to locate this resource. Finally, if it also does not exist in “root”, an error will be returned to the requester. The error code indicates whether the fallback mechanism described above was performed or not. Script Hant Hans Country US IE DE CH TW CN CN TW 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

Architecture Overview 2 What is ICU? Architecture Overview 2 Open and Close Service Model Better performance by avoiding setup costs per operation Warning: use properly for maximum performace ICU Threading Model Multiple versions in use simultaneously Large resources shared in read-only cache The "open and close" model supports multi-threading. It enables ICU users to use the same kind of service for different locales, either in the same thread or in different threads. For example, a thread can open many collators for different languages, and different threads can use different collators for the same locale simultaneously. Constant data can be shared so that only the current state is allocated for each editor. When a service object of a previously requested locale is instantiated, a small chunk of memory is used for the state of that service object, with pointers to the same shared, read-only data. Thus, the majority of the memory usage is shared. When any service is closed, then the chunk of memory is de-allocated. Other connections that point to the same shared data stay valid. Thus, the normal mode of operation is to: Open a service with a given locale. Use the service as long as needed. However, do not keep opening and closing a service within a tight loop. Clone a service if it needs to be used in parallel in another thread. Close any clones that you open as well as any instances of the services that are owned. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

Architecture Overview 3 What is ICU? Architecture Overview 3 Data Driven Services Customize at build-time or run-time Interchange with other platforms; same results on each Rule-based Collation, Word-breaks, Transforms Pattern-based Formats, UnicodeSet Table-based Character Conversion The ICU services are data driven so you can customize any ICU services either at build-time or run-time. For example, you can add a user defined collation sequence for en_US to overwrite the system collation service behavior for en_US locale. This can be done either by providing a user package and roll that into the package tools at build-time. Or, you can create a collator instance by specifying the collation sequence at run-time. Since ICU is ported to many platforms, you are guaranteed to get the same results of any ICU services on the ported platforms. The data format of each service varies depending on its functionality. Some are rule based, e.g. collation and break iteration. Some are pattern based, e.g. number/date formatting. The rule based objects typically contain more complicated and larger textual description of the service behavior where as the pattern based objects are usually more light-weight. In addition, we also have table-based objects such as character set converters. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

Architecture Overview – ICU4C What is ICU? Architecture Overview – ICU4C Simple Error Handling C++ subset for portability Support for multi-threaded environment Version Management Multiple versions at the same time Data and library versioning String Buffer Management Preflighting and overflow protection Misc: Load/Unload ICU Recent Additions: Runtime-settable memory allocation and mutex functions There are additional architecture features in ICU4C: Error handling:In order for ICU to maximize portability, only a subset of the C++ language are used in ICU so we can compile ICU4C correctly on older C++ compilers and provide a usable C interface. Thus, there is no use of the C++ exception mechanism in the code or Application Programming Interface (API). To communicate errors reliably and support multi- threading, ICU uses an error code parameter mechanism. Every function that can fail takes an error-code parameter by reference. This parameter is always the last parameter listed for the function. ICU enables you to call several functions (that take error codes) successively without having to check the error code after each function. Each function usually must check the error code before doing any other processing, since it is supposed to stop immediately after receiving an error code. Version management:Starting ICU 2.0, a versioning management mechanism is put in place such that the application can use multiple versions of ICU on the same computer at the same time. A program can keep using ICU 2.0 collation, for example, while using ICU 2.2 for other services. This allows applications to use older versions of services for exact bit-for-bit fidelity, while linking to newer versions to access new features. In addition, both the ICU data and libraries are versioned to ease the configuration and release management of using ICU. String buffer management:In ICU, strings are either terminated with a NUL character (code point 0, U+0000) or their length is specified. In the latter case, it is possible to have one or more NUL characters inside the string. The C APIs are designed such that the user can retrieve the preflighting information so typical user cases can use stack buffer for performance or allocate additional space to recover from previously failed operation due to buffer overflow. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

ICU4J: Supplement for Java Core globalization (no char. conversion, no GUI components) We do supply complex text support for Sun Modularized: products may add just needed functionality CLDR 1.1 (Common Locale Data Repository) Up-to-date globalization: standards-compliant; latest Unicode Supplementary character support (GB 18030, JIS X 213, HKSCS) Full properties – JDK has only a fraction Local calendars (Thailand, Japan,…); ISO dates Currencies, String Search, Int’l Domain Names Transforms: Case, Scripts, Normalization Much faster turn-around on bug-fixes, enhancements 9/20/2018

Unicode Text Handling C C++ Java All handle supplementary characters UChar*: null-terminated or with length C++ UnicodeString: full featured string class Java Uses normal JDK String, adds utilities All handle supplementary characters Required for GB 18030 and JIS 213 repertoire 9/20/2018

Unicode Text Handling 2 All Unicode 4.0 properties direct API What is ICU? Unicode Text Handling 2 All Unicode 4.0 properties direct API values, names, enumerations UnicodeSet fast, compact set operations all properties: [\p{lowercase}-[a-z]] [\p{greek} & \p{uppercase}] Non-han properties 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

Data: Recent Additions Conforms to CLDR 1.1 50% more data than CLDR 1.0: adding many translated terms for languages, scripts, countries, currencies, and time zones. improved collation for Eastern Europe, Chinese pinyin Reduced multiplatform install image size Improved XLIFF-ICU conversion tools Locale canonicalization spec defined and implemented (C+J) Provides interoperability with POSIX and .NET locale IDs, more RFC 3066 support 9/20/2018

Character Set Conversion Precise alias information: When you ask for “SJIS”, you can request the precise definition by platform: windows, ibm, solaris,… Buffer management automatically handles characters that cross buffers Customizations allowed for: illegal sequences undefined characters Unicode Text Compression – SCSU, BOCU 9/20/2018

Collation and Searching Fast international comparison and string search; fully UCA compliant Compressed sort keys, optimized string comparison, sublinear string search incremental sortkeys for radix-sort Precise binary sortkey stability over time Fully data driven API / rule customizations strength, normalization, upper vs. lowercase first, ignore punctuation, … 9/20/2018

Collation and Searching: Recent Additions Numeric sorting: sequences of digits can be sorted numerically instead of alphabetically e.g., filenames would sort "ab-2" < "ab-10" without material performance cost with reduced sortkey length. Significantly improved sorting orders for many other languages Data in separate tree, for easier modularization and maintenance getFunctionalEquivalent API allows for better caching and UI support. 9/20/2018

Calendar & Time Zones International Calendars – Arabic, Buddhist, Hebrew, Japanese Required for correct presentation of dates in some countries Olson timezone support, with localizations Recent Additions: RFC822 time zone format support in DateFormat (C+J) for compatibility. 9/20/2018

Formatting Date & time: 8 formats per locale Messages Completely localizable, Plural support Numbers & currencies Scientific Notation, Spelled-out (checks, etc.) Full Orthogonal Currency support INR In Hindi: INR In English: Rs. 1,234.57 INR In German: Rs. 1.234,57 Recent Additions POSIX migration library Allows parsing multiple currencies with one formatter Short and stand-alone month/day names 9/20/2018

Transforms Unicode Normalization Highly optimized for performance What is ICU? Transforms Unicode Normalization Highly optimized for performance performance utilities: concatenation, detection, comparison Casing (upper, lower, title, folding) General Transforms Script transliterations Half-width/Full-width, Hex, etc. Chain transforms together, filter source characters Rule-based, customizable at runtime. IDNA: International Domain Names 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

Segmentation: word, line & sentence Fast state-table implementation Customizable Rule-based – customizable at runtime Special customizations, e.g. Thai Recent Additions: Greatly improved performance when going backwards (common case when doing line break) Java The rules syntax has been extended. Rules can now return information about the types of characters they encountered. Common compiled (binary) rule format with ICU4C 9/20/2018

Unicode Regular Expressions Full Regex Implementation C only: Java 1.4 has own package (though not as powerful) All Unicode 4.0 Properties supported through UnicodeSet Good performance competitive with non-Unicode regex Recent Additions Now features a C API, instead of just C++. 9/20/2018

Complex-text layout engine Glyph processing, positioning & adjustment ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc. Support for: Drawing Caret Display Hit Testing Selection Highlighting Caret Movement Layout Metrics Line Break ICU 3.0: Canonical Equivalence: a + ´ or á 9/20/2018

References ICU main site: Unicode Consortium What is ICU? References ICU main site: http://oss.software.ibm.com/icu/ Links to Download ICU User Guide, Technical FAQ, Support, Bug Reports Unicode Consortium http://www.unicode.org Unicode glossary, Unicode character database 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta

Questions and Answers What is ICU? 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta