An ICU Overview Mark Davis Chief Globalization Architect, IBM What is ICU? An ICU Overview Mark Davis Chief Globalization Architect, IBM IBM Globalization Center of Competency 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
Agenda What is ICU? Architecture Overview Significant New ICU Features Near Future Features References Q and A ICU (International Components for Unicode) is a collaborative, open-source development project jointly managed by a group of companies and individual volunteers throughout the world, using the Internet and the Web to communicate, plan, and develop the software and documentation. It is sponsored and used by IBM. Comprehensive support for the Unicode Standard is the basis for multilingual, single- binary software. ICU uses the most current versions of the standard, and provides full support for supplementary characters. As computing environments become more heterogeneous, software portability becomes more important. ICU lets your produce the same results across all the various platforms you support. It offers great flexibility to extend and customize the supplied system services. For more information, see the ICU website. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
Why Globalization? 9/20/2018
Unicode All world languages Efficient and effective processing Lossless data exchange Enables single-binary global software But… all languages ⇒ large, complex standard 1,400 pages + Annexes + additional standards 90,000+ characters Major update every 3 years 70 character properties, many multi-valued Affects many processes: display, line-break, regex, … 9/20/2018
Locales Features vary widely across languages & countries Sorting, line breaks, date/time/number/currency formatting, codepage conversion, … Performance is key: easy to do the right thing; hard to do it fast 9/20/2018
What is ICU? Globalization / Unicode / Locales Mature, widely used set of C/C++ and Java libraries Basis for Java 1.1 internationalization – but goes far beyond Very portable – identical results on all platforms / programming languages C/C++: 30+ platforms/compilers Java: IBM & Sun JDK Full threading model; customizable; modular Open source – but not viral ICU 3.0: 78 languages; 118 countries; 870 codepages 9/20/2018
Who uses ICU? Products Within IBM PSD Print Architecture, DB2, COBOL, Host Access Client, InfoPrint Manager, Informix GLS version 4.0, iSeries, Lotus Notes, Lotus Extended Search, Lotus Workplace, MQ Integrator Endeavour, NUMA-Q, OTI, Pervasive Computing WECMS, SS&S Websphere Banking Solutions, Tivoli Presentation Services, WBI Adapter/ Connect/Modeler and Monitor/ Solution Technology Development/WBI-Financial TePI, Websphere Application Server/ Studio Workload Simulator/Transcoding Publisher, XML Parser Other Companies and Organizations Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business Objects, caris, CERN, Cognos, Debian, Gentoo, HP, Inktomi, JD Edwards, Jikes, Macromedia, Mathworks, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE, Sybase, Virage, webMethods, Wine, Leica Geosystems GIS & Mapping, LLC. 9/20/2018
ICU Features Unicode text handling Charset conversions (700+) What is ICU? ICU Features Unicode text handling Charset conversions (700+) Collation & Searching Locales (170+) Resource Bundles Calendar & Time zones Complex-text layout engine Unicode Regular Expressions Breaks: word, line, … Formatting Date & time Messages Numbers & currencies Transforms Normalization Casing Transliterations In addition to basic Unicode standard conformance, both ICU4J and ICU4C also provide a full set of internationalization features listed above. Notes on C/C++ vs. Java ICU C/C++ and Java APIs do differ slightly due to the differences of programming languages. Sometimes the feature development in ICU4C leapfrogs ICU4J or vice versa by 1-2 releases. Though JDK already supports codepage conversion natively in Java, there are more character conversion features available in ICU4C than in JDK. ICU4C character set conversion features are available to Java users via ICU4JNI. Since ICU is open source and closely tracks the Unicode Standard, ICU can support changes and additions to the Unicode Standard much more quickly than Java. Java support for Unicode is tied to major releases of the JDK, and can lag the Unicode Standard by a year or more. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
Architecture Overview 1 What is ICU? Architecture Overview 1 Locale Based Services Locale is an identifier, not a container Keywords for variants: de@collation=phonebook Resource inheritance: shared resources root Language en de zh Unlike POSIX locale model, the locale in ICU is just a light-weight identifier. The locale Ids are composed of language, country and variant information.It follows the ISO-639 and ISO- 3166 standards for the naming. For example, Italian and Italy are designated as it_IT. All the locale-sensitive ICU services use the locale information to determine language and other locale specific parameters of their function. Other parts of the library use the locale as an indicator to customize their behavior. The ICU service data uses the locale ID as a key to map to data. The service data can be added, removed or modified without source code changes. The locale object also defines the concept of a default locale. The default locale is the locale, used by many programs, that regulates the rest of the computer's behavior by default. It is set to the system locale on the platform at startup time in ICU. Just like in Java, the resource management mechanism adopts the resource bundle inheritance model based on locales. This is important since any information exists in the parent resource bundle is not duplicated in the child resource bundle. The look up mechanism works such that If a specific resource is not found in the requested bundle, we will try to locate the resource a more general bundle, default locale. If the resource can’t be found in the default bundle either, the "root" bundle is used to locate this resource. Finally, if it also does not exist in “root”, an error will be returned to the requester. The error code indicates whether the fallback mechanism described above was performed or not. Script Hant Hans Country US IE DE CH TW CN CN TW 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
Architecture Overview 2 What is ICU? Architecture Overview 2 Open and Close Service Model Better performance by avoiding setup costs per operation Warning: use properly for maximum performace ICU Threading Model Multiple versions in use simultaneously Large resources shared in read-only cache The "open and close" model supports multi-threading. It enables ICU users to use the same kind of service for different locales, either in the same thread or in different threads. For example, a thread can open many collators for different languages, and different threads can use different collators for the same locale simultaneously. Constant data can be shared so that only the current state is allocated for each editor. When a service object of a previously requested locale is instantiated, a small chunk of memory is used for the state of that service object, with pointers to the same shared, read-only data. Thus, the majority of the memory usage is shared. When any service is closed, then the chunk of memory is de-allocated. Other connections that point to the same shared data stay valid. Thus, the normal mode of operation is to: Open a service with a given locale. Use the service as long as needed. However, do not keep opening and closing a service within a tight loop. Clone a service if it needs to be used in parallel in another thread. Close any clones that you open as well as any instances of the services that are owned. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
Architecture Overview 3 What is ICU? Architecture Overview 3 Data Driven Services Customize at build-time or run-time Interchange with other platforms; same results on each Rule-based Collation, Word-breaks, Transforms Pattern-based Formats, UnicodeSet Table-based Character Conversion The ICU services are data driven so you can customize any ICU services either at build-time or run-time. For example, you can add a user defined collation sequence for en_US to overwrite the system collation service behavior for en_US locale. This can be done either by providing a user package and roll that into the package tools at build-time. Or, you can create a collator instance by specifying the collation sequence at run-time. Since ICU is ported to many platforms, you are guaranteed to get the same results of any ICU services on the ported platforms. The data format of each service varies depending on its functionality. Some are rule based, e.g. collation and break iteration. Some are pattern based, e.g. number/date formatting. The rule based objects typically contain more complicated and larger textual description of the service behavior where as the pattern based objects are usually more light-weight. In addition, we also have table-based objects such as character set converters. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
Architecture Overview – ICU4C What is ICU? Architecture Overview – ICU4C Simple Error Handling C++ subset for portability Support for multi-threaded environment Version Management Multiple versions at the same time Data and library versioning String Buffer Management Preflighting and overflow protection Misc: Load/Unload ICU Recent Additions: Runtime-settable memory allocation and mutex functions There are additional architecture features in ICU4C: Error handling:In order for ICU to maximize portability, only a subset of the C++ language are used in ICU so we can compile ICU4C correctly on older C++ compilers and provide a usable C interface. Thus, there is no use of the C++ exception mechanism in the code or Application Programming Interface (API). To communicate errors reliably and support multi- threading, ICU uses an error code parameter mechanism. Every function that can fail takes an error-code parameter by reference. This parameter is always the last parameter listed for the function. ICU enables you to call several functions (that take error codes) successively without having to check the error code after each function. Each function usually must check the error code before doing any other processing, since it is supposed to stop immediately after receiving an error code. Version management:Starting ICU 2.0, a versioning management mechanism is put in place such that the application can use multiple versions of ICU on the same computer at the same time. A program can keep using ICU 2.0 collation, for example, while using ICU 2.2 for other services. This allows applications to use older versions of services for exact bit-for-bit fidelity, while linking to newer versions to access new features. In addition, both the ICU data and libraries are versioned to ease the configuration and release management of using ICU. String buffer management:In ICU, strings are either terminated with a NUL character (code point 0, U+0000) or their length is specified. In the latter case, it is possible to have one or more NUL characters inside the string. The C APIs are designed such that the user can retrieve the preflighting information so typical user cases can use stack buffer for performance or allocate additional space to recover from previously failed operation due to buffer overflow. 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
ICU4J: Supplement for Java Core globalization (no char. conversion, no GUI components) We do supply complex text support for Sun Modularized: products may add just needed functionality CLDR 1.1 (Common Locale Data Repository) Up-to-date globalization: standards-compliant; latest Unicode Supplementary character support (GB 18030, JIS X 213, HKSCS) Full properties – JDK has only a fraction Local calendars (Thailand, Japan,…); ISO dates Currencies, String Search, Int’l Domain Names Transforms: Case, Scripts, Normalization Much faster turn-around on bug-fixes, enhancements 9/20/2018
Unicode Text Handling C C++ Java All handle supplementary characters UChar*: null-terminated or with length C++ UnicodeString: full featured string class Java Uses normal JDK String, adds utilities All handle supplementary characters Required for GB 18030 and JIS 213 repertoire 9/20/2018
Unicode Text Handling 2 All Unicode 4.0 properties direct API What is ICU? Unicode Text Handling 2 All Unicode 4.0 properties direct API values, names, enumerations UnicodeSet fast, compact set operations all properties: [\p{lowercase}-[a-z]] [\p{greek} & \p{uppercase}] Non-han properties 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
Data: Recent Additions Conforms to CLDR 1.1 50% more data than CLDR 1.0: adding many translated terms for languages, scripts, countries, currencies, and time zones. improved collation for Eastern Europe, Chinese pinyin Reduced multiplatform install image size Improved XLIFF-ICU conversion tools Locale canonicalization spec defined and implemented (C+J) Provides interoperability with POSIX and .NET locale IDs, more RFC 3066 support 9/20/2018
Character Set Conversion Precise alias information: When you ask for “SJIS”, you can request the precise definition by platform: windows, ibm, solaris,… Buffer management automatically handles characters that cross buffers Customizations allowed for: illegal sequences undefined characters Unicode Text Compression – SCSU, BOCU 9/20/2018
Collation and Searching Fast international comparison and string search; fully UCA compliant Compressed sort keys, optimized string comparison, sublinear string search incremental sortkeys for radix-sort Precise binary sortkey stability over time Fully data driven API / rule customizations strength, normalization, upper vs. lowercase first, ignore punctuation, … 9/20/2018
Collation and Searching: Recent Additions Numeric sorting: sequences of digits can be sorted numerically instead of alphabetically e.g., filenames would sort "ab-2" < "ab-10" without material performance cost with reduced sortkey length. Significantly improved sorting orders for many other languages Data in separate tree, for easier modularization and maintenance getFunctionalEquivalent API allows for better caching and UI support. 9/20/2018
Calendar & Time Zones International Calendars – Arabic, Buddhist, Hebrew, Japanese Required for correct presentation of dates in some countries Olson timezone support, with localizations Recent Additions: RFC822 time zone format support in DateFormat (C+J) for compatibility. 9/20/2018
Formatting Date & time: 8 formats per locale Messages Completely localizable, Plural support Numbers & currencies Scientific Notation, Spelled-out (checks, etc.) Full Orthogonal Currency support INR In Hindi: INR In English: Rs. 1,234.57 INR In German: Rs. 1.234,57 Recent Additions POSIX migration library Allows parsing multiple currencies with one formatter Short and stand-alone month/day names 9/20/2018
Transforms Unicode Normalization Highly optimized for performance What is ICU? Transforms Unicode Normalization Highly optimized for performance performance utilities: concatenation, detection, comparison Casing (upper, lower, title, folding) General Transforms Script transliterations Half-width/Full-width, Hex, etc. Chain transforms together, filter source characters Rule-based, customizable at runtime. IDNA: International Domain Names 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
Segmentation: word, line & sentence Fast state-table implementation Customizable Rule-based – customizable at runtime Special customizations, e.g. Thai Recent Additions: Greatly improved performance when going backwards (common case when doing line break) Java The rules syntax has been extended. Rules can now return information about the types of characters they encountered. Common compiled (binary) rule format with ICU4C 9/20/2018
Unicode Regular Expressions Full Regex Implementation C only: Java 1.4 has own package (though not as powerful) All Unicode 4.0 Properties supported through UnicodeSet Good performance competitive with non-Unicode regex Recent Additions Now features a C API, instead of just C++. 9/20/2018
Complex-text layout engine Glyph processing, positioning & adjustment ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc. Support for: Drawing Caret Display Hit Testing Selection Highlighting Caret Movement Layout Metrics Line Break ICU 3.0: Canonical Equivalence: a + ´ or á 9/20/2018
References ICU main site: Unicode Consortium What is ICU? References ICU main site: http://oss.software.ibm.com/icu/ Links to Download ICU User Guide, Technical FAQ, Support, Bug Reports Unicode Consortium http://www.unicode.org Unicode glossary, Unicode character database 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta
Questions and Answers What is ICU? 9/20/2018 23nd International Unicode Conference 9/20/2018 Atlanta