26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager.

Slides:



Advertisements
Similar presentations
Globalization Gotchas
Advertisements

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Draft Java/ICU Internationalization Architecture Mark Davis.
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Introduction to Java 2 Programming Lecture 10 API Review; Where Next.
© 2008 EBSCO Information Services SUSHI, COUNTER and ERM Systems An Update on Usage Standards Ressources électroniques dans les bibliothèques électroniques.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Open-Source Approaches to Unicode Enablement Panel Discussion.
June 2004 Adil Allawi Technical Director
1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
Developing Arabic Applications with Visual Studio 2005 Dina Lasheen Program Manager – Developer Division.
Unicode and Windows XP Cathy Wissink Program Manager Globalization Infrastructure, Design and Development Windows International Microsoft.
Building Localized Applications with Microsoft.NET Framework and Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corp.
Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide Standardizing Application.
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
23rd Internationalization and Unicode Conference, Prague, Czech Republic – March, 2003 Common XML Locale Repository Dr. Mark Davis
Advantage Data Dictionary. agenda Creating and Managing Data Dictionaries –Tables, Indexes, Fields, and Triggers –Defining Referential Integrity –Defining.
Programming Languages Language Design Issues Why study programming languages Language development Software architectures Design goals Attributes of a good.
28/1/2001 Seminar in Databases in the Internet Environment Introduction to J ava S erver P ages technology by Naomi Chen.
Integration of Applications MIS3502: Application Integration and Evaluation Paul Weinberg Adapted from material by Arnold Kurtz, David.
DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)
Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
24rd Internationalization and Unicode Conference, Atlanta, GA USA – Sept 2003 Common XML Locale Repository Dr. Mark Davis Steven.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
.NET, and Service Gateways Group members: Andre Tran, Priyanka Gangishetty, Irena Mao, Wileen Chiu.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.

Model Bank Testing Accelerators “Ready-to-use” test scenarios to reduce effort, time and money.
27th Internationalization and Unicode ConferenceBerlin, Germany, April 2005 ICU Overview The Open-Source Unicode Library, v3.2 Markus Scherer ICU Manager.
IBM Maximo Asset Management © 2007 IBM Corporation Tivoli Technical Exchange Calls Aug 31, Maximo - Multi-Language Capabilities Ritsuko Beuchert.
What’s New in Sage SalesLogix V Release Overview Sage SalesLogix v7.5.2 focuses on: −User Enhancements streamline the user experience furthering.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
Internationalization and the Java Stack Part 1 Matt Wheeler.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
© 2006 IBM Corporation 29th Internationalization and Unicode Conference ICU Overview: The Open Source Unicode Library Markus Scherer IBM Globalization.
COLD FUSION Deepak Sethi. What is it…. Cold fusion is a complete web application server mainly used for developing e-business applications. It allows.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
Company Confidential 1 This presentation is solely for the use of Patni personnel. No part of it may be circulated, quoted, or reproduced for distribution.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
San Jose, California – September, 2002 Transliteration of Indic Scripts Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency.
Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Cross Language Clone Analysis Team 2 October 13, 2010.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
ICU Overview: The Open Source Unicode Library
8 th Semester, Batch 2009 Department Of Computer Science SSUET.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 ICU Low-level Utilities and Resource Management Vladimir Weinstein Globalization Center.
Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM.
July 19, 2004Joint Techs – Columbus, OH Network Performance Advisor Tanya M. Brethour NLANR/DAST.
Repository Manager 1.3 Product Overview Name Title Date.
Random Logic l Forum.NET l Localization & Globalization Forum.NET ● May 29, 2006.
Enterprise Wide Information Systems SAP R/3 Overview & Basis Technology Instructor: Richard W. Vawter.
SAP Overview.
Featured Enhancements to the IDE & Debugger
An ICU Overview Mark Davis Chief Globalization Architect, IBM
MSIS 655 Advanced Business Applications Programming
ApplinX Rod Carlson Senior Technical Lead.
What's New in eCognition 9
Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta
What's New in eCognition 9
What's New in eCognition 9
Presentation transcript:

26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager IBM Globalization Center of Competency

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Agenda  Background  What is ICU?  Architecture Overview  ICU Features and recent additions  References  Q and A

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Why Globalization?

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Unicode  All world languages  Efficient and effective processing  Lossless data exchange  Enables single-binary global software  But… all languages ⇒ large, complex standard –1,400 pages + Annexes + additional standards –90,000+ characters –Major update every 3 years –70 character properties, many multi-valued –Affects many processes: display, line-break, regex, …

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Locales  Features vary widely across languages & countries –Sorting, line breaks, date/time/number/currency formatting, codepage conversion, … –Performance is key: easy to do the right thing; hard to do it fast

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 What is ICU?  Globalization / Unicode / Locales  Mature, widely used set of C/C++ and Java libraries –Basis for Java 1.1 internationalization – but goes far beyond  Very portable – identical results on all platforms / programming languages –C/C++: 30+ platforms/compilers –Java: IBM & Sun JDK  Full threading model; customizable; modular  Open source – but not viral  ICU 3.0: 78 languages; 118 countries; 870 codepages

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Who uses ICU?  Products Within IBM –PSD Print Architecture, DB2, COBOL, Host Access Client, InfoPrint Manager, Informix GLS version 4.0, iSeries, Lotus Notes, Lotus Extended Search, Lotus Workplace, MQ Integrator Endeavour, NUMA- Q, OTI, Pervasive Computing WECMS, SS&S Websphere Banking Solutions, Tivoli Presentation Services, WBI Adapter/ Connect/Modeler and Monitor/ Solution Technology Development/WBI-Financial TePI, Websphere Application Server/ Studio Workload Simulator/Transcoding Publisher, XML Parser  Other Companies and Organizations –Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business Objects, caris, CERN, Cognos, Debian, Gentoo, HP, Inktomi, JD Edwards, Jikes, Macromedia, Mathworks, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE, Sybase, Virage, webMethods, Wine, Leica Geosystems GIS & Mapping, LLC.

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 ICU Features  Unicode text handling  Charset conversions (870+)  Collation & Searching  Locales (170+)  Resource Bundles  Calendar & Time zones  Complex-text layout engine  Unicode Regular Expressions  Breaks: word, line, …  Formatting –Date & time –Messages –Numbers & currencies  Transforms –Normalization –Casing –Transliterations

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Architecture Overview 1  Locale Based Services –Locale is an identifier, not a container –Keywords for variants:  Resource inheritance: shared resources root en USIE de DECH zh HantHans TWCNTWCN Language Script Country

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Architecture Overview 2  Open and Close Service Model –Better performance by avoiding setup costs per operation –Warning: use properly for maximum performace  ICU Threading Model –Multiple versions in use simultaneously –Large resources shared in read-only cache

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Architecture Overview 3  Data Driven Services –Customize at build-time or run-time –Interchange with other platforms; same results on each –Rule-based Collation, Word-breaks, Transforms –Pattern-based Formats, UnicodeSet –Table-based Character Conversion

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Architecture Overview – ICU4C  Simple Error Handling –C++ subset for portability –Support for multi-threaded environment  Version Management –Multiple versions at the same time –Data and library versioning  String Buffer Management –Preflighting and overflow protection  Misc: Load/Unload ICU  Recent Additions: – Runtime-settable memory allocation and mutex functions

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Architecture Overview – ICU4J  Supplement for Java  Core globalization (no char. conversion, no GUI components) –We do supply complex text support for Sun  Modularized: products may add just needed functionality

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 ICU4J vs. JDK  CLDR 1.1 (Common Locale Data Repository)  Up-to-date globalization: standards-compliant; latest Unicode –Supplementary character (GB 18030, JIS X 213, HKSCS) –Full properties – JDK has only a fraction –Local calendars (Thailand, Japan,…); ISO dates –Currencies, String Search, Int’l Domain Names –Transforms: Case, Scripts, Normalization  Much faster turn-around on bug-fixes, enhancements

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Unicode Text Handling  C –UChar*: null-terminated or with length  C++ –UnicodeString: full featured string class  Java –Uses normal JDK String, adds utilities  All handle supplementary characters –Required for GB 18030/JIS X 0213/HKSCS repertoires

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Unicode Text Handling 2  All Unicode 4.0 properties –Direct API Values, names, enumerations –UnicodeSet Fast, compact set operations Pattern-based (both Perl & POSIX syntax for properties) – \p{greek} vs. [:greek:] All properties: – [\p{lowercase}-[a-z]] – [\p{greek} & \p{uppercase}]

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Data: Recent Additions  Conforms to CLDR 1.1 –50% more data than CLDR 1.0: adding many translated terms for languages, scripts, countries, currencies, and time zones. –improved collation for Eastern Europe, Chinese pinyin  Reduced multiplatform install image size  Improved XLIFF-ICU conversion tools  Locale canonicalization spec defined and implemented (C+J) –Provides interoperability with POSIX and.NET locale IDs, more RFC 3066 support

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Character Set Conversion  Precise alias information: –When you ask for “SJIS”, you can request the precise definition by platform: windows, ibm, solaris,…  Buffer management –automatically handles characters that cross buffers  Customizations allowed for: –illegal sequences –undefined characters  Unicode Text Compression – SCSU, BOCU

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Collation and Searching  Fast international comparison and string search; fully UCA compliant –Compressed sort keys, optimized string comparison, sublinear string search –incremental sortkeys for radix-sort  Precise binary sortkey stability over time  Fully data driven  API / rule customizations –strength, normalization, upper vs. lowercase first, ignore punctuation, …

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Collation and Searching: Recent Additions  Numeric sorting: sequences of digits can be sorted numerically instead of alphabetically –e.g., filenames would sort "ab-2" < "ab-10" –without material performance cost –with reduced sortkey length.  Significantly improved sorting orders for many other languages  Data in separate tree, for easier modularization and maintenance  getFunctionalEquivalent API allows for better caching and UI support.

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Calendar & Time Zones  International Calendars – Arabic, Buddhist, Hebrew, Japanese –Required for correct presentation of dates in some countries  Olson timezone support, with localizations  Recent Additions: –RFC822 time zone format support in DateFormat (C+J) for compatibility.

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Formatting  Date & time: 8 formats per locale  Messages –Completely localizable, Plural support  Numbers & currencies –Scientific Notation, Spelled-out (checks, etc.) –Full Orthogonal Currency support INRIn Hindi: INRIn English:Rs. 1, INRIn German:Rs ,57  Recent Additions –POSIX migration library –Allows parsing multiple currencies with one formatter –Short and stand-alone month/day names

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Transforms  Unicode Normalization –Highly optimized for performance –performance utilities: concatenation, detection, comparison  Casing (upper, lower, title, folding)  General Transforms –Script transliterations –Half-width/Full-width, Hex, etc. –Chain transforms together, filter source characters –Rule-based, customizable at runtime.  IDNA: International Domain Names

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Segmentation: word, line & sentence  Fast state-table implementation  Customizable –Rule-based – customizable at runtime –Special customizations, e.g. Thai  Recent Additions: – Greatly improved performance when going backwards (common case when doing line break) –Java The rules syntax has been extended. Rules can now return information about the types of characters they encountered. Common compiled (binary) rule format with ICU4C

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Unicode Regular Expressions  Full Regex Implementation –C only: Java 1.4 has own package (though not as powerful)  All Unicode 4.0 Properties –supported through UnicodeSet  Good performance –competitive with non-Unicode regex  Recent Additions –Now features a C API, instead of just C++.

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Complex-text layout engine  Glyph processing, positioning & adjustment –ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc.  Support for: –Drawing –Caret Display –Hit Testing –Selection Highlighting –Caret Movement –Layout Metrics –Line Break  ICU 3.0: Canonical Equivalence: a + ´ or á

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 References  ICU main site: – –Links to Download ICU User Guide, Technical FAQ, Support, Bug Reports  Unicode Consortium – Unicode glossary, Unicode character database

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference San José, CA, September 2004 Questions and Answers