Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2006 IBM Corporation 29th Internationalization and Unicode Conference ICU Overview: The Open Source Unicode Library Markus Scherer IBM Globalization.

Similar presentations


Presentation on theme: "© 2006 IBM Corporation 29th Internationalization and Unicode Conference ICU Overview: The Open Source Unicode Library Markus Scherer IBM Globalization."— Presentation transcript:

1 © 2006 IBM Corporation 29th Internationalization and Unicode Conference ICU Overview: The Open Source Unicode Library Markus Scherer IBM Globalization Center of Competency

2 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 2 29th Internationalization and Unicode Conference Agenda  Background Information  What is ICU?  Architecture Overview –Significant New ICU Features  References  Q and A

3 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 3 29th Internationalization and Unicode Conference Why Globalization?

4 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 4 29th Internationalization and Unicode Conference Unicode  Handles all modern world languages  Efficient and effective processing  Lossless data exchange  Enables single-binary global software  But… all languages ⇒ large, complex standard –1,400 pages + Annexes + additional standards –Almost 100,000 characters –Major update every 3 years; minor update about once a year –70 character properties, many multi-valued –Affects many processes: display, line-break, regular expressions, …

5 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 5 29th Internationalization and Unicode Conference Internationalization, Localization & Locales  Requirements vary widely across languages & countries –Sorting –Text searching –Line breaks –Date/time/number/currency formatting –Codepage conversion –…and so on  Performance is key –It is easy to do the right thing –It is hard to do it fast

6 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 6 29th Internationalization and Unicode Conference What is ICU?  International Components for Unicode  Globalization / Unicode / Locales  Mature, widely used set of C/C++ and Java libraries –Basis for Java 1.1 internationalization, but goes far beyond Java 1.1  Very portable – identical results on all platforms / programming languages –C/C++ (ICU4C): 30+ platforms/compilers –Java (ICU4J): IBM & Sun JDK –C/C++ with Java (ICU4JNI)  Full threading model  Customizable & Modular  Open source – but non-restrictive

7 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 7 29th Internationalization and Unicode Conference Who uses ICU?  Products Within IBM –All 5 major software brands –Many other related software applications –Used on all IBM operating systems  Other Companies and Organizations –Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business Objects, Caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP, Home Depot, Inktomi, JD Edwards, Macromedia, Mathworks, MKS, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE Linux, Sybase, Virage, webMethods, Wine, Leica Geosystems GIS & Mapping LLC., Xerox, Yahoo!...and many more

8 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 8 29th Internationalization and Unicode Conference ICU Features  Unicode text handling  Charset conversions (700+)  Collation & Searching  Locales from CLDR (250+)  Resource Bundles  Calendar & Time zones  Complex-text layout engine  Unicode Regular Expressions  Breaks: word, line, …  Formatting –Date & time –Messages –Numbers & currencies  Transforms –Normalization –Casing –Transliterations

9 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 9 29th Internationalization and Unicode Conference Architecture Overview 1  Locale Based Services –Locale is an identifier, not a container –Keywords for variants: de@collation=phonebook  Resource inheritance: shared resources root en USIE de DECH zh HantHans TWCNTWCN Language Script Region

10 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 10 29th Internationalization and Unicode Conference Architecture Overview 2  Open and Close Service Model –Open a service object, use it many times, close it when done –Better performance by avoiding setup costs per operation  ICU Threading Model –Multiple service objects in use simultaneously with same or different attributes –Large resources shared in read-only cache –Compatible with Java threading model

11 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 11 29th Internationalization and Unicode Conference Architecture Overview 3  Data Driven Services –Customize at build-time or run-time –Interchange with other platforms; same results on each –Rule-based Collation, Word-breaks, Transforms –Pattern-based Date/Time/Number/Message formatting –Table-based Character Conversion

12 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 12 29th Internationalization and Unicode Conference Architecture Overview – ICU4C  Simple Error Handling –Thread safe –Works in C and C++  C/C++ subset for portability  Version Management –Multiple versions of ICU4C in the same process memory space –Data and library versioning  String Buffer Management –Preflighting and overflow protection  Flexible –Allows Loading and Unloading ICU4C libraries –Runtime settable memory allocation and mutex functions

13 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 13 29th Internationalization and Unicode Conference Architecture Overview – ICU4J  Supplement for Java  Core globalization (no character conversion or regular expressions) –We do supply complex text support for Sun  Modularized: products may add just needed functionality  Usually drop-in replacement for JDK functionality –Changing the import statements is usually all that is needed

14 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 14 29th Internationalization and Unicode Conference ICU4J: Supplement for Java  CLDR (Common Locale Data Repository) –More fully supported locales than Java  Up-to-date globalization: standards-compliant; latest Unicode –Supplementary character (GB 18030, JIS X 213, HKSCS) Java 5 adds handling of supplementary characters –Full properties – JDK has only a fraction –Unicode Collation Algorithm –Local calendars (Islamic, Japan,…); more time zone localizations –Currencies, String Search, Internationalized Domain Names –Transforms: Case, Scripts, Normalization  Much shorter release cycle and quicker support for Unicode standard

15 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 15 29th Internationalization and Unicode Conference Recent Additions for ICU4J  ICU4J in Eclipse Rich Client Platform –Take advantage of advanced globalization and support for more locales and more recent Unicode –Consistent behavior between application code and framework code –ICU4J 3.4+ in Eclipse 3.2 –No-op jar for JDK behavior and low-memory devices  Improved portability & JRE independence –Back-ported to Java 1.3 (via preprocessor) –Carries own Time Zone data

16 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 16 29th Internationalization and Unicode Conference Unicode Text Handling  All UTF-16 processing  C –UChar*: null-terminated or with length  C++ –UnicodeString: full featured string class  Java –Uses java.lang.String and adds utilities  All handle supplementary characters –Required for GB 18030, HKSCS and JIS X 213 repertoires

17 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 17 29th Internationalization and Unicode Conference Abstract Text Access  Work with non-UTF-16 text –UTF-16 text in discontinuous buffers –UTF-8/32 APIs (char/wchar_t compatibility) –Non-Unicode APIs, SBCS or MBCS  UText API: Efficient interface –Provider implementation converts to UTF-16 –Fast inline code for text access within blocks –Storage-native indexes avoid index translation –Writable  For now only C/C++, used in few services Line Break UTF-8, SBCS, … UText

18 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 18 29th Internationalization and Unicode Conference Unicode Character Handling  All Unicode properties (except Unihan) –Direct API Values, names, enumerations –UnicodeSet Fast, compact set operations (union, intersection, …) Pattern-based (both Perl & POSIX syntax for properties) – \p{greek} vs. [:greek:] All properties: – [\p{lowercase}-[a-z]] – [\p{greek} & \p{uppercase}] Alphabetic Ideographic a ξ ँ ై Uppercase A Ξ 不 与不 与

19 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 19 29th Internationalization and Unicode Conference Recent Additions  Conforms to CLDR 1.3/1.4 –Adds and confirms many translated terms for languages, scripts, regions, currencies, and time zones –Access to more CLDR items  Support for Unicode interpretation of POSIX properties  Charset detection API (ICU4J only)  Load Java resource bundles from ULocale, support zh_Hant etc.  Better modularization for memory constrained environments (ICU4C only)  New icupkg tool for data customization –Easily update pre-built or installed data packages (.dat files)

20 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 20 29th Internationalization and Unicode Conference Character Set Conversion  Precise alias information: –When you ask for “Shift-JIS”, you can request the precise definition by platform (e.g. Windows, IBM, Java, … )  Buffer management –API automatically handles characters that cross buffers –Can provide offset mappings between byte buffer and UChar buffer  Runtime customizations allowed for: –Illegal sequences –Undefined characters  Unicode Text Compression – SCSU, BOCU-1  Consistent conversion results across platforms  You can use more character sets at runtime or build time

21 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 21 29th Internationalization and Unicode Conference Collation: Sorting, Searching and Matching  Fast international comparison for string search; fully UCA compliant –Compressed sort keys, optimized string comparison, sublinear string search –Incremental sortkeys used for radix sorting  Precise binary sortkey stability over time (library versioning)  Fully data driven –Many common rules provided  Runtime and build time rule customizations –Strength, normalization, upper vs. lowercase first, ignore punctuation, numeric, … –Only delta from UCA is needed for rule customization

22 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 22 29th Internationalization and Unicode Conference Calendar & Time Zones  International Calendars – Islamic, Buddhist, Hebrew, Japanese –Required for correct presentation of dates in some countries  Olson timezone support with localizations  Recent Additions: –Many more time zone localizations

23 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 23 29th Internationalization and Unicode Conference Formatting  Date & time: 8 formats per locale by default  Messages –Completely localizable, plural support  Numbers & currencies –Scientific Notation, Spelled-out (checks, etc.) –Full Orthogonal Currency support (any currency × any language) INRIn Hindi: रु१, २३४. ५७ INRIn English:Rs. 1,234.57 INRIn German:Rs. 1.234,57  Recent Additions –Flexible date/time formatting: Select format for desired set of fields –List available currencies API –Short and stand-alone month/day names

24 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 24 29th Internationalization and Unicode Conference Globalization Preferences  ICU: Convenient container for globalization preference values  Provides defaults for missing values  Convenient instantiation of related formatters ExampleStandard Language en_US (or en-US ) RFC 3066 (or successor) Region AU ISO 3166 Currency EUR ISO 4217 Time zone Australia/Melbourne TZ DB Calendar islamic-civil CLDR Calendar ID Custom Date yyyy-mmm-dd CLDR Pattern Format VAT 08.23% (books), 15.73% (food) App/Country-Specific …Exact Composition Depends on System Requirements!

25 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 25 29th Internationalization and Unicode Conference Transforms  Unicode Normalization –Highly optimized for performance –performance utilities: concatenation, detection, comparison  Casing (upper, lower, title, folding)  General Transforms –Script transliterations –Half-width/Full-width, Hex, etc. –Chain transforms together, filter source characters –Rule-based, customizable at runtime.  StringPrep: Internationalized Domain Names (IDN), NFS

26 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 26 29th Internationalization and Unicode Conference Segmentation: word, line & sentence  Fast state-table implementation  Customizable –Rule-based – customizable at runtime –Special customizations, e.g. Thai  Recent Additions: –Uses new UText API –ICU4J rule syntax aligned with ICU4C –Tracks Unicode Line Break changes

27 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 27 29th Internationalization and Unicode Conference Unicode Regular Expressions  Full Regex Implementation –C/C++ only: Java 1.4 has own package (though not as powerful)  All Unicode Properties –Supported through UnicodeSet  Good performance –Competitive with non-Unicode regex

28 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 28 29th Internationalization and Unicode Conference Complex-text layout engine  Glyph processing, positioning & adjustment –Ligature substitution, contextual forms, kerning, accent placement, bidi scripts, etc.  Support for: –Information for drawing –Caret Display –Hit Testing –Selection Highlighting –Caret Movement –Layout Metrics –Line Break –Canonical Equivalence: a + ´ or á  Recent Additions: –Tibetan, Sinhala, Indic ZWJ/ZWNJ, Kerning

29 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 29 29th Internationalization and Unicode Conference References  ICU main site: –http://www.ibm.com/software/globalization/icu/http://www.ibm.com/software/globalization/icu/ –Links to Download ICU User Guide, Technical FAQ, Support, Bug Reports, Demonstrations  ICU support site: –http://icu.sourceforge.net/http://icu.sourceforge.net/  Unicode Consortium –http://www.unicode.org/http://www.unicode.org/ Unicode glossary, Unicode character database

30 ICU Overview: The Open Source Unicode Library San Francisco, California, March, 2006 30 29th Internationalization and Unicode Conference Questions and Answers


Download ppt "© 2006 IBM Corporation 29th Internationalization and Unicode Conference ICU Overview: The Open Source Unicode Library Markus Scherer IBM Globalization."

Similar presentations


Ads by Google