27th Internationalization and Unicode ConferenceBerlin, Germany, April 2005 ICU Overview The Open-Source Unicode Library, v3.2 Markus Scherer ICU Manager.

Slides:



Advertisements
Similar presentations
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Advertisements

Draft Java/ICU Internationalization Architecture Mark Davis.
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Open-Source Approaches to Unicode Enablement Panel Discussion.
June 2004 Adil Allawi Technical Director
1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.
26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager.
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
23rd Internationalization and Unicode Conference, Prague, Czech Republic – March, 2003 Common XML Locale Repository Dr. Mark Davis
Information Retrieval in Practice
Programming Languages Language Design Issues Why study programming languages Language development Software architectures Design goals Attributes of a good.
Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Overview of Search Engines
24rd Internationalization and Unicode Conference, Atlanta, GA USA – Sept 2003 Common XML Locale Repository Dr. Mark Davis Steven.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
JAVA v.s. C++ Programming Language Comparison By LI LU SAMMY CHU By LI LU SAMMY CHU.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.

Creating Interfaces: Localization Language & other issues character codes Homework: preparation for future topics.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
IBM Maximo Asset Management © 2007 IBM Corporation Tivoli Technical Exchange Calls Aug 31, Maximo - Multi-Language Capabilities Ritsuko Beuchert.
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
SOFTWARE INTERNATIONALIZATION Dallas Ramsden. Internationalization GOAL Software that can run ANYWHERE in the world without having the source code changed.
A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
Internationalization and the Java Stack Part 1 Matt Wheeler.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
© 2006 IBM Corporation 29th Internationalization and Unicode Conference ICU Overview: The Open Source Unicode Library Markus Scherer IBM Globalization.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
Company Confidential 1 This presentation is solely for the use of Patni personnel. No part of it may be circulated, quoted, or reproduced for distribution.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
San Jose, California – September, 2002 Transliteration of Indic Scripts Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
.NET Framework Danish Sami UG Lead.NetFoundry
Copyright © 2006 Addison-Wesley. All rights reserved.1-1 ICS 410: Programming Languages.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Implementation Issues Mark Davis Properties.
Copyright © 2007 Addison-Wesley. All rights reserved.1-1 Reasons for Studying Concepts of Programming Languages Increased ability to express ideas Improved.
WEB BASED DATA TRANSFORMATION USING XML, JAVA Group members: Darius Balarashti & Matt Smith.
IBM Software Group ® Overview of SA and RSA Integration John Jessup June 1, 2012 Slides from Kevin Cornell December 2008 Have been reused in this presentation.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.
Cross Language Clone Analysis Team 2 October 13, 2010.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Cross Language Clone Analysis Team 2 February 3, 2011.
STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.
ICU Overview: The Open Source Unicode Library
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
Markus W. Scherer IBM Cupertino August 20 th, 2001Globalizing eBusiness – SDForum Unicode and XML Globalizing eBusiness Tools of the Trade: Unicode and.
Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM.
July 19, 2004Joint Techs – Columbus, OH Network Performance Advisor Tanya M. Brethour NLANR/DAST.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Repository Manager 1.3 Product Overview Name Title Date.
Complex Text Layout Issues with examples from Myanmar
Introduction to JSP Liu Haibin 12/09/2018.
An ICU Overview Mark Davis Chief Globalization Architect, IBM
MSIS 655 Advanced Business Applications Programming
Presentation transcript:

27th Internationalization and Unicode ConferenceBerlin, Germany, April 2005 ICU Overview The Open-Source Unicode Library, v3.2 Markus Scherer ICU Manager IBM Globalization Center of Competency

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Agenda  Background  What is ICU?  Architecture Overview  ICU Features and recent additions  References  Q and A

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Why Globalization?

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Unicode  All world languages  Efficient and effective processing  Lossless data exchange  Enables single-binary global software  But… all languages ⇒ large, complex standard –1,400 pages + Annexes + additional standards –90,000+ characters –Major update every 3 years –70 character properties, many multi-valued –Affects many processes: display, line-break, regex, …

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Locales  Features vary widely across languages & countries –Sorting, line breaks, date/time/number/currency formatting, codepage conversion, … –Performance is key: easy to do the right thing; hard to do it fast

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 What is ICU?  Globalization / Unicode / Locales  Mature, widely used set of C/C++ and Java libraries –Basis for Java 1.1 internationalization – but goes far beyond –“ICU4C”: C/C++ libraries; “ICU4J”: Java library  Very portable – identical results on all platforms / programming languages –C/C++: 30+ platforms/compilers –Java: IBM & Sun JDK  Full threading model; customizable; modular  Open source – but not viral  ICU 3.2: 78 languages; 118 countries; 870 codepages

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Who uses ICU? (Examples)  Products Within IBM –DB2, COBOL, InfoPrint Manager, Lotus Notes, Lotus Workplace, Tivoli Presentation Services, WebSphere, XML Parser, …  Other Companies and Organizations –Adobe, Apple (Mac OS X), BEA, CERN, Cognos, Debian, HP, Inktomi, JD Edwards, Macromedia, Mathworks, Mozilla, NCR, OpenOffice, PayPal, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE, Sybase, webMethods, …

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 ICU Features  Unicode text handling  Charset conversions (870+)  Collation & Searching  Locales (170+)  Resource Bundles  Calendar & Time zones  Complex-text layout engine  Unicode Regular Expressions  Breaks: word, line, …  Formatting –Date & time –Messages –Numbers & currencies  Transforms –Normalization –Casing –Transliterations

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Architecture Overview 1  Locale Based Services –Locale is an identifier, not a container –Keywords for variants: –Recent addition: accept-language support  Resource inheritance: shared resources root en USIE de DECH zh HantHans TWCNTWCN Language Script Country

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Architecture Overview 2  Open and Close Service Model –Open a service object, use it many times, close it when done –Better performance by avoiding setup costs per operation –Warning: use properly for maximum performace  ICU Threading Model –Multiple service objects in use simultaneously, with same or different attributes –Large resources shared in read-only cache

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Architecture Overview 3  Data Driven Services –Customize at build-time or run-time –Interchange with other platforms; same results on each –Rule-based Collation, Word-breaks, Transforms –Pattern-based Formats, UnicodeSet –Table-based Character Conversion

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Architecture Overview – ICU4C  Simple Error Handling –C++ subset for portability –Support for multi-threaded environment  Version Management –Multiple versions at the same time –Data and library versioning  String Buffer Management –Preflighting and overflow protection  Misc: Load/Unload ICU  Recent Additions: – Runtime-settable memory allocation and mutex functions

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Architecture Overview – ICU4J  Supplement for Java  Core globalization (no character conversion or regular expressions, no GUI components) –We do supply complex text support for Sun  Modularized: products may add just needed functionality

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 ICU4J vs. JDK  CLDR 1.2 (Common Locale Data Repository)  Up-to-date globalization: standards-compliant; latest Unicode –Supplementary character (GB 18030, JIS X 213, HKSCS) Java 5 adds handling of supplementary characters –Full properties – JDK has only a fraction –Unicode Collation Algorithm –Local calendars (Thailand, Japan,…); ISO dates –Currencies, String Search, Int’l Domain Names –Transforms: Case, Scripts, Normalization  Much faster turn-around on bug fixes, enhancements

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Unicode Text Handling  C –UChar*: null-terminated or with length  C++ –UnicodeString: full featured string class  Java –Uses normal JDK String, adds utilities  All handle supplementary characters –Required for GB 18030/JIS X 0213/HKSCS repertoires

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Unicode Text Handling 2  All Unicode properties –Direct API Values, names, enumerations –UnicodeSet Fast, compact set operations Pattern-based (both Perl & POSIX syntax for properties) – \p{greek} vs. [:greek:] All properties: – [\p{lowercase}-[a-z]] – [\p{greek} & \p{uppercase}]

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Data: Recent Additions  Conforms to CLDR 1.2 –50% more data than CLDR 1.0: adding many translated terms for languages, scripts, countries, currencies, and time zones. –Added data for new languages: Malayalam, Oriya, Welsh  Reduced multiplatform install image size  Improved XLIFF-ICU conversion tools  Locale canonicalization spec defined and implemented (C+J) –Provides interoperability with POSIX and.NET locale IDs, more RFC 3066 support

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Character Set Conversion  Precise alias information: –When you ask for “SJIS”, you can request the precise definition by platform: Windows, IBM, Solaris,…  Buffer management –automatically handles characters that cross buffers  Customizations allowed for: –illegal sequences –undefined characters  Unicode Text Compression – SCSU, BOCU-1

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Collation and Searching  Fast international comparison and string search; fully UCA compliant –Compressed sort keys, optimized string comparison, sublinear string search –incremental sortkeys for radix-sort  Precise binary sortkey stability over time  Fully data driven  API / rule customizations –strength, normalization, upper vs. lowercase first, ignore punctuation, sort digits as numbers, …

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Collation and Searching: Recent Additions  Numeric sorting: sequences of digits can be sorted numerically instead of alphabetically –e.g., filenames would sort "ab-2" < "ab-10" –without material performance cost –with reduced sortkey length.  Significantly improved sorting orders for many other languages  Data in separate tree, for easier modularization and maintenance  getFunctionalEquivalent API allows for better caching and UI support.

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Calendar & Time Zones  International Calendars – Arabic, Buddhist, Hebrew, Japanese –Required for correct presentation of dates in some countries  Olson timezone support, with localizations  Recent Additions: –RFC822 time zone format support in DateFormat (C+J) for compatibility. –“Universal Time” conversions for high-precision date/time computations

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Formatting  Date & time: 8 formats per locale  Messages –Completely localizable, Plural support  Numbers & currencies –Scientific Notation, Spelled-out (checks, etc.) –Full Orthogonal Currency support INRIn Hindi: INRIn English:Rs. 1, INRIn German:Rs ,57  Recent Additions –POSIX migration library –Allows parsing multiple currencies with one formatter –Short and stand-alone month/day names

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Transforms  Unicode Normalization –Highly optimized for performance –performance utilities: concatenation, detection, comparison  Casing (upper, lower, title, folding)  General Transforms –Script transliterations –Half-width/Full-width, Hex, etc. –Chain transforms together, filter source characters –Rule-based, customizable at runtime.  IDNA: International Domain Names

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Segmentation: word, line & sentence  Fast state-table implementation  Customizable –Rule-based – customizable at runtime –Special customizations, e.g. Thai  Recent Additions: – Greatly improved performance when going backwards (common case when doing line break) –Java The rules syntax has been extended. Rules can now return information about the types of characters they encountered. Common compiled (binary) rule format with ICU4C

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Unicode Regular Expressions  Full Regex Implementation –C only: Java 1.4 has own package (though not as powerful)  All Unicode Properties –supported through UnicodeSet  Good performance –competitive with non-Unicode regex  Recent Additions –Now features a C API, instead of just C++.

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Complex-text layout engine  Glyph processing, positioning & adjustment –ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc.  Support for: –Drawing –Caret Display –Hit Testing –Selection Highlighting –Caret Movement –Layout Metrics –Line Break  Recent addition: Canonical Equivalence: a + ´ or á

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 References  ICU main site: – New URL –Links to Download ICU User Guide, Technical FAQ, Support, Bug Reports  Unicode Consortium – Unicode glossary, Unicode character database

ICU Overview: The Open-Source Unicode Library, v th Internationalization and Unicode Conference Berlin, Germany, April 2005 Questions and Answers