21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein Globalization.

Slides:



Advertisements
Similar presentations
Debugging ACL Scripts.
Advertisements

Unicode Normalization Mark Davis
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Building Bug-Free O-O Software: An Introduction to Design By Contract A presentation about Design By Contract and the Eiffel software development tool.
Combinational Circuits
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Technical BI Project Lifecycle
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
CMSC 345, Version 11/07 SD Vick from S. Mitchell Software Testing.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Reference Book: Modern Compiler Design by Grune, Bal, Jacobs and Langendoen Wiley 2000.
Illinois Institute of Technology
ECE122 L3: Expression Evaluation February 6, 2007 ECE 122 Engineering Problem Solving with Java Lecture 3 Expression Evaluation and Program Interaction.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
The Re-engineering and Reuse of Software
TCP Sockets Reliable Communication. TCP As mentioned before, TCP sits on top of other layers (IP, hardware) and implements Reliability In-order delivery.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Software Construction. Implementation System Specification Requirements Analysis Architectural Design Detailed Design Coding & Debugging Unit Testing.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
DCS Overview MCS/DCS Technical Interchange Meeting August, 2000.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
Invitation to Computer Science, Java Version, Second Edition.
San Jose, California – September, 2002 Transliteration of Indic Scripts Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency.
Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
Chapter 2: Using Data.
Handy separation the report template into pages Handy visual separation of the report template into pages is available in Stimulsoft Reports.Net. You.
Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
© 2006 Pearson Education 1 More Operators  To round out our knowledge of Java operators, let's examine a few more  In particular, we will examine the.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Extended Finite-State Machine Inference with Parallel Ant Colony Based Algorithms PPSN’14 September 13, 2014 Daniil Chivilikhin PhD student ITMO.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Advanced Algorithm Design and Analysis (Lecture 12) SW5 fall 2004 Simonas Šaltenis E1-215b
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
Sorting and Searching by Dr P.Padmanabham Professor (CSE)&Director
20081 Converting workspaces and using SALT & subversion to maintain them. V1.02.
Assoc. Prof. Dr. Ahmet Turan ÖZCERİT.  System and Software  System Engineering  Software Engineering  Software Engineering Standards  Software Development.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM.
Software Testing Reference: Software Engineering, Ian Sommerville, 6 th edition, Chapter 20.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Appendix B Agile Methodologies
Introduction to Compiler Construction
Introduction Python is an interpreted, object-oriented and high-level programming language, which is different from a compiled one like C/C++/Java. Its.
Invitation to Computer Science, Java Version, Third Edition
Binary Search Back in the days when phone numbers weren’t stored in cell phones, you might have actually had to look them up in a phonebook. How did you.
Lecture 09:Software Testing
Objective of This Course
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Chapter 1: Creating a Program.
Presentation transcript:

21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization Center of Competency, San Jose, CA

21 st International Unicode Conference Dublin, Ireland, May Introduction 1.Unicode standard has multiple ways to encode equivalent strings résumére sumére sume NFD:  NFC:  résume  2.Accents that don’t interact are put into a unique order  

21 st International Unicode Conference Dublin, Ireland, May Introduction (contd.) Normalization provides a way to transform a string to an unique form (NFD, NFC) Strings that can be transformed to the same form are called canonically equivalent Time-critical applications need to minimize the number of passes over the text ICU gives a number of tools to deal with this problem We will use collation (language-sensitive string comparison) as an example

21 st International Unicode Conference Dublin, Ireland, May Avoiding Normalization Force users to provide already normalized data The performance problem does not go away When the strings are processed many times, it could be beneficial to normalize them beforehand Forcing users to provide a specific form can be unpopular

21 st International Unicode Conference Dublin, Ireland, May Check for Normalized Text Most strings are already in normalized form Quick Check is significantly faster than the full normalization Needs canonical class data and additional data for checking the relation between a code point and a normalization form Algorithm in UAX #15 Annex 8 ( #Annex8) #Annex8

21 st International Unicode Conference Dublin, Ireland, May Normalize Incrementally Instead of normalizing the whole string at once, normalize one piece at a time This technique is usually combined with an incremental Quick Check Useful for procedures with early exit, such as string comparing or scanning Normalizes up to the next safe point

21 st International Unicode Conference Dublin, Ireland, May Incremental Normalization: Example re sume résumé re sumerésumé Initial string Normalize just the parts that fail quick check Non incremental normalization Quick check Incremental normalization If normalized regularly, the whole string is processed by normalization

21 st International Unicode Conference Dublin, Ireland, May Optimized Concatenation Simple concatenation of two normalized strings can yield a string that is not normalized One option is to normalize the result Unnecessarily duplicates normalization

21 st International Unicode Conference Dublin, Ireland, May Optimized Concatenation: Example Find boundariesConcatenate then normalize Concatenate and normalize up to the boundaries re sumé + re sumé résumé rsumé + e r e résumé It is enough to normalize the boundary parts Incremental normalization is used Much faster than redoing the whole resulting string

21 st International Unicode Conference Dublin, Ireland, May Accepting the FCD Form Fast Composed or Decomposed form is a partially normalized form Not unique More lenient than NFD or NFC form It requires that the procedure has support for all the canonically equivalent strings on input It is possible to quick check the FCD format

21 st International Unicode Conference Dublin, Ireland, May FCD Form: Examples SEQUENCEFCDNFCNFD A-ringYY AngstromY A + ringYY A + graveYY A-ring + graveY A + cedilla + ringYY A + ring + cedilla A-ring + cedillaY

21 st International Unicode Conference Dublin, Ireland, May Canonical Closure Preprocessing data to support the FCD form Ensures that if data is assigned to a sequence (or a code point) it will also be assigned to all canonically equivalent FCD sequences Å = X A+ = X Å = X, => A-ring (U+00C5) Angstrom sign (U+212B) A + combining ring above (U+0041 U+030A)

21 st International Unicode Conference Dublin, Ireland, May Collation Locale specific sorting of strings Relation between code points and collation elements Context sensitive: –Contractions: H < Z, but CZ < CH –Expansions: OE < Œ < OF –Both: カー キイ See “Collation in ICU” by Mark Davis

21 st International Unicode Conference Dublin, Ireland, May Collation Implementation in ICU Two modes of operation: –Normalization OFF: expects the users to pass in FCD strings –Normalization ON: accepts any strings Some locales require normalization to be turned on Canonical closure done for contractions and regular mappings Two important services –Sort key generation –String compare function More about ICU at the end of presentation

21 st International Unicode Conference Dublin, Ireland, May FCD Support in Collation Much higher performance Values assigned to a code point or a contraction are equal to those for its FCD canonically equivalent sequences This process is time consuming, but it is done at build time May increase data set

21 st International Unicode Conference Dublin, Ireland, May Sort Key Generation Whole strings are processed Sort keys tend to get reused, so the emphasis is on producing as short sort keys as possible Two modes of operation –Normalization ON: strings are quick checked and normalization is performed, if required –Normalization OFF: depends on strings being in FCD form. The performance increases by 20% to 50%

21 st International Unicode Conference Dublin, Ireland, May String Compare Very time critical Result is usually determined before fully processing both strings First step is binary comparison for equality When it fails, comparison continues from a safe spot A Å No need to backup, normal situation c h c z Must backup to the start of contraction Must backup to the normalization safe spot

21 st International Unicode Conference Dublin, Ireland, May String Compare Continued Normalization ON: incremental FCD check and incremental FCD normalization if required Normalization OFF: assumes that the source strings are FCD Most locales don’t require normalization on and thus are 20% faster by using FCD

21 st International Unicode Conference Dublin, Ireland, May International Components for Unicode International Components for Unicode(ICU) is a library that provides robust and full-featured Unicode support The ICU normalization engine supports the optimizations mentioned here Library services accept FCD strings as input Wide variety of supported platforms Open source (X license – non-viral) C/C++ and JAVA versions

21 st International Unicode Conference Dublin, Ireland, May Conclusion The presented techniques allow much faster string processing In case of collation, sort key generation gets up to 50% faster than if normalizing beforehand String compare function becomes up to 3 times faster! May increase data size Canonical closure preprocessing takes more time to build, but pays off at runtime

21 st International Unicode Conference Dublin, Ireland, May Q & A

21 st International Unicode Conference Dublin, Ireland, May Summary Introduction Avoiding normalization Check for normalized text Normalize incrementally Concatenation of normalized strings Accepting the FCD form Implementation of collation in ICU