Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM.

Slides:



Advertisements
Similar presentations
Globalization Gotchas
Advertisements

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Draft Java/ICU Internationalization Architecture Mark Davis.
Unicode Normalization Mark Davis
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
Open-Source Approaches to Unicode Enablement Panel Discussion.
Information Retrieval in Practice
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
With Microsoft Access 2010 © 2011 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Access.
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
27th Internationalization and Unicode ConferenceBerlin, Germany, April 2005 ICU Overview The Open-Source Unicode Library, v3.2 Markus Scherer ICU Manager.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Computer Science Standard Level Mastery Aspects. Mastery Item Claimed JustificationWhere Listed Arrays Used to store the student data Lines P.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.
Data and its manifestations. Storage and Retrieval techniques.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 4th Edition Copyright © 2009 John Wiley & Sons, Inc. All rights.
Implementation Issues Mark Davis Properties.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
Eagle: Maturation and Evolution 17th Annual Tcl Conference Joe Mistachkin.
Chapter 4 Memory Management Virtual Memory.
© All rights reserved. U.S International Tech Support
Hashing as a Dictionary Implementation Chapter 19.
Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 Sorting and Searching Helena Shih GCoC Manager IBM.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Python Primer 1: Types and Operators © 2013 Goodrich, Tamassia, Goldwasser1Python Primer.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
ABAP Dictionary Introduction Tables in the ABAP Dictionary Performance in Table Access Consistency through Input Check Dependencies of ABAP Dictionary.
ICU Overview: The Open Source Unicode Library
Cupertino, CA, USA / September, 2000First ICU DeveloperWorkshop1 Transformation Support Alan Liu Globalization Center of Competency IBM Emerging Technology.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 5th Edition Copyright © 2015 John Wiley & Sons, Inc. All rights.
SVBIT SUBJECT:- Operating System TOPICS:- File Management
Basics of Unicode (base upon a presentation by NRSI, SIL International)
Information Retrieval in Practice
Sorts, CompareTo Method and Strings
“<Fill in your definition here.>”
Compsci 201 Priority Queues & Autocomplete
Data Modeling II XML Schema & JAXB Marc Dumontier May 4, 2004
Database Performance Tuning and Query Optimization
MG4J – Managing GigaBytes for Java Introduction
Disk Storage, Basic File Structures, and Buffer Management
Object Oriented Programming COP3330 / CGS5409
Collation in ICU Mark Davis IBM Globalization Center of Competency
File Storage and Indexing
Derek Morgan, Principal Statistical Programmer, PAREXEL International
Chapter 11 Database Performance Tuning and Query Optimization
Eagle: Maturation and Evolution
Real-World File Structures
Procedure Linkages Standard procedure linkage Procedure has
Presentation transcript:

Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM

Agenda What is Collation? Features Mechanisms Warnings ICU 1.8 Collation Note: Slides differ from printouts

Collation = Sorting Order How hard can it be? A < B < C < … Complications Languages are complex and varied Unicode is a big set of characters Performance is crucial

Varies By: Language Swedish: z < ö German: ö < z Usage Dictionary: öf < of Telephone: of < öf Customizations A < a a < A Versioning Fixes New Gov. Stds New Characters

Levels 1.Base characters: a < b 2.Accents: as < às < at ignored if there is a L1 character difference 3.Case: ao < Ao < aò ignored if there is a L1 or L2 difference 4.Punctuation: ab < a-b < aB ignored* if there is a L1, L2, or L3 difference

Context Sensitivity Contractions H < Z, but CZ < CH Expansions OE < Œ < OF Both カー < カイ キー > キイ

Canonical Equivalence Å≡Å ≡A + º x +. + ^≡x + ^ +. ự≡u + ’ ≡ư +. ≡ụ + ’ ≡ u +. + ’ ≡ u + ̛ +.

Oddities Normal accents cote < coté < côte < côté first accent difference determines order French accents cote < côte < coté < côté last accent difference determines order Il-logical Order (Thai, Lao) เ ก sorts like ก เ

Merging Database Fields F1 = LastName, F2 = FirstName SequentialWeak 1 st Merged F1, then F2F1 (L1), F2L1, L2, L3 diSilva, John diSilva, Fred di Silva, John di Silva, Fred dísilva, John dísilva, Fred diSilva, John dísilva, John di Silva, John di Silva, Fred diSilva, Fred dísilva, Fred diSilva, John di Silva, John dísilva, John diSilva, Fred di Silva, Fred dísilva, Fred

Customizations Parameters that change collation behavior Choice of language (locale) Runtime choices Examples to follow

Parametric Customizations Strength Base Base + Accent Base + Accent + Case Case: A < a a < A Punctuation: di Silva < diSilva

Punctuation (Alternates) Base Character di silva di Silva Di silva Di Silva Dickens disilva diSilva Disilva DiSilva Ignoreable Dickens di silva disilva di Silva diSilva Di silva Disilva Di Silva DiSilva

Extended Customizations User-defined “&” ≡ “ampersand” Merging tailorings Iranian + French Script Order b < ב < β < б β < b < б < ב Numbers A-1 < A-234 A-234 < A-1

Collation also used for: Searching ignore case, accent options Selection Return all records where Jones ≤ name < Smith Graphemes What a user considers a “character” Regular expressions (Level 3) UTR #18

UCA UTS #10: Unicode Collation Algorithm Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc. Default ordering: all Unicode code points Provides for tailoring to given languages Also see: The Unicode Standard, §5.17: Sorting and Searching Aligned with ISO 14651

APIs String Compare Sort Keys String Search

Sort Keys Transform string into series of bytes which will binary-compare a: 06 C A: 06 C á: 06 C ab: 06 C3 06 D b: 06 D Level 1 Level 2 Level 3

String Compare vs. Sort Keys Same results in either case SC faster for single comparisons average 5 to 10 times! SK faster for multiple comparisons index once binary compare many times

String Search Naïve Approach key matches in target at iff target.substring(x, y) ≡ key Boundary Complications Ignorables: “a” matches in “(a)”? at & & & ? Contractions: “c” matches in “churo”? Normalization: “å” matches in “a¸ ˚ ”?

WARNING 1: Basics Not aligned with character set or repertoire Latin-1: Swedish and German sorting differs Not code point (binary) order Binary:Z < a < v < w English:Z > a Swedish:v ≡ w Not a property of strings With same database Swedish user: view/select German user: view/select

WARNING 2: Operations Order not preserved under concatenation / substringing x < y ↛ xz < yz x < y ↛ zx < zy xz < yz ↛ x < y zx < zy ↛ x < y

WARNING 3: Dependence Collation is a relation over strings Sort keys embody part of that relation Thus, comparing sort keys from different tailorings (or parameters) gives undefined results. C < CH < D May move binary value for D

WARNING 4: Stability Stable Sort Records with equal comparison come out in original order Property of algorithm, not comparison Semi-Stable Comparison x ≠ y → x ≢ y Property of comparison, not algorithm Degrades performance Doesn’t do what people think (or really want)!

ICU (Int’l Components for Unicode) Open-source: C, C++, Java, JNI Charset Conversions, Locales, Resources, Collation, Calendars, Time zones (daylight), Transliteration, Normalization, Boundaries (grapheme, word, line, sentence), Format/Parse (numbers, currencies, dates, times, messages) Cross-Platform: Windows, Unix, 390, … Architecture ≡ Java

ICU/Java Collation Architecture L1-3, contractions, expansions, … Locale tailorings Fully rule-based specification Arbitrary runtime user customizations & ‘?’ = ‘question mark’ & ‘$’ = ‘dollar sign’ & z < ‘george’

ICU Collation Revision full UCA compliance full supplementary character support much better performance much smaller sort-keys smaller memory footprint smaller disk footprint additional parametric control additional tailoring control

Coding Style for Performance Avoided unnecessary function calls. Example: strlen too expensive! Avoided use of objects Rewrote core code in C C++ API wraps the C core code. Fast-pathed common cases Used stack memory buffers (with expansion if necessary) Made inner loops as tight as possible

Fractional UCA Fractional weights for compression Gaps for tailoring, future UCA additions Only stores differences in tailoring file Reduces memory footprint

Flat File I Flat-file (memory mapped) speeds initialization reduces memory footprint (next slide)

Flat-File II Old: separate allocations New: offsets within mem-map

Delta Tailoring II “a” FR found UCA not found code not synthesized

Processing Overview Checks for identical prefixes Tolerant of most unnormalized text invokes normalization rarely Uses “exceptional values” Compresses sort keys Incremental length/normalization

Identical Prefixes Sorting / Searching Databases Many comparisons to “close” strings Check initial prefixes with binary compare Drop into collation loop at first difference Complication…

Initial Prefix Complication Need to backup if in “bad” position:

Fast C or D (FCD) Accepts all NFD, most NFC, without normalization

Exceptional Values Normal weight storage Special Weight Storage NOT_FOUND, EXPANSION, CONTRACTION, THAI, …

Sort Key Compression Common weights are 1-byte Primary, secondary, tertiary, quarternary Sequences are compressed UTF-16 Values for “Märk Davis” (22 bytes) 004D 00E B Sort Key (L3, ignorable punctuation - 19 bytes) 2F B 1D B A 01 8F 80 8F 07 00

ICU 1.8 vs. Windows, glibc Full UCA Warning: perf. comparisons approx. Depends on data, parameters, features glibc - UTF-8 locales String comparison: comparable ≈ 20% worse to 400% better Sort keys: shorter ≈ half as long

More Information ICU Design Document ml/design/collation/ These Slides Q & A

Backup Slides

WARNING 5: Math. Relation S = {Unicode Strings} Reflexive ∀ a ∊ S: a ≤ a Antisymmetric ∀ a, b ∊ S: a ≤ b & b ≤ a → a = b Transitive ∀ a, b ∊ S: a ≤ b & b ≤ c → a ≤ c Total ∀ a, b ∊ S: a ≤ b ∨ b ≤ a