32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.

Slides:



Advertisements
Similar presentations
Globalization Gotchas
Advertisements

The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Space-for-Time Tradeoffs
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, Java Version, Third Edition.
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
CompSci Searching & Sorting. CompSci Searching & Sorting The Plan  Searching  Sorting  Java Context.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Chapter 7 Space and Time Tradeoffs Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Chapter 2: Algorithm Discovery and Design
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
Invitation to Computer Science, Java Version, Second Edition.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
San Jose, California – September, 2002 Transliteration of Indic Scripts Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency.
Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.
Chapter 7 Space and Time Tradeoffs James Gain & Sonia Berman
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
MA/CSSE 473 Day 24 Student questions Quadratic probing proof
  ;  E       
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Application: String Matching By Rong Ge COSC3100
MA/CSSE 473 Day 27 Hash table review Intro to string searching.
Simple Iterative Sorting Sorting as a means to study data structures and algorithms Historical notes Swapping records Swapping pointers to records Description,
MA/CSSE 473 Day 23 Student questions Space-time tradeoffs Hash tables review String search algorithms intro.
Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 Sorting and Searching Helena Shih GCoC Manager IBM.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Review 1 Arrays & Strings Array Array Elements Accessing array elements Declaring an array Initializing an array Two-dimensional Array Array of Structure.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Fundamental Data Structures and Algorithms
Searching Topics Sequential Search Binary Search.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM.
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
13 Text Processing Hongfei Yan June 1, 2016.
Space-for-time tradeoffs
Collation in ICU Mark Davis IBM Globalization Center of Competency
Chapter 7 Space and Time Tradeoffs
Space-for-time tradeoffs
Space-for-time tradeoffs
Space-for-time tradeoffs
Improved Two-Way Bit-parallel Search
Space-for-time tradeoffs
Presentation transcript:

32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Matching, Sorting, and Searching with Unicode Text Eric R. Mader, Michael Ow IBM Globalization Architecture and Technology

232nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Topics ➢ Matching –Are two characters the same? ➢ Sorting –Proper ordering of characters ➢ Searching –Finding the desired pattern in a given text

332nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Different Ways to Implement ➢ “Non-Globalized Way” –Concern with individual language only –Using codepages that are language specific or limited ➢ “Globalized Way” –Using Unicode –Using Unicode Collation Allgorithm –Consideration for all languages

432nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Non-Globalized Way –Matching using the individual code points a (0x61) = a (0x61) * using US-ASCII

532nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Non-Globalized Way, cont. –Sorting can also be done by comparing code points or a simple mapping of the code points (e.g. in EBDIC) Unsorted Sorted CharacterCode Point b0x62 a0x61 d0x64 c0x63 CharacterCode Point a0x61 b0x62 c0x63 d0x64 *Using US-ASCII

632nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Non-Globalized Way, cont. ➢ Searching –Many different algorithms in string search* –Searching of text done through analysis of character code points only *algorithms can be globalized

732nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Basics of String Search Algorithms –There are many ways to search through text: Linear Search Boyer-Moore Quick Search * There are many more, but for time constraints we will discuss the common ones that are listed above.

832nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Simple String Search Algorithm ➢ Linear Search –Brute force search –Check every character against pattern –Very slow –No preprocessing time –Performance: O (m * n) * m is the size of the pattern * n is the size of the text

932nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

10 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

11 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

12 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

13 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

14 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

15 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

16 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

17 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

18 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

19 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field Match Found

20 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Better String Search Algorithm ➢ Boyer-Moore –Intelligently skips characters in the text based on the pattern –Very fast –Preprocessing time: O(m + |∑|) –Performance: O(n/m) * m is the size of the pattern * n is the size of the text

21 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Preprocessing of Pattern: Bad Character Shift Table Good Suffix Shift Table LetterShift f4 i3 e2 l1 other5 # of Matches PatternShift d5 2-ld5 3-eld5 4-ield5

22 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field LetterShift f4 i3 e2 l1 other5

23 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

24 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

25 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

26 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field # of Matches PatternShift d5 2-ld5 3-eld5 4-ield5

27 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

28 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

29 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

30 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

31 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field Match Found

32 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Another Fast String Search Algorithm ➢ Quick Search –Skips based on character after pattern Can compare in any order –Very fast –Preprocessing time: O(m + |∑|) –Performance: O(n) for average case * m is the size of the pattern * n is the size of the text

33 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Preprocessing of Pattern: Bad Character Shift Table LetterShift f5 i4 e3 l2 d1 other6

34 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field LetterShift f5 i4 e3 l2 d1 other6

35 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field LetterShift f5 i4 e3 l2 d1 other6

36 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

37 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

38 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

39 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

40 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field Match Found

41 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 The Unicode Standard ➢ Features : Single encoding for all languages Encodes over 90,000 characters ➢ Issues: Canonical equivalence More than one sort order Sorting is context sensitive Sorting strength levels Other sorting issues

42 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Canonical Equivalence Å≡Å ≡A + º x +. + ^≡x + ^ +. ự≡u + ’ ≡ư +. ≡ụ + ’ ≡ u +. + ’ ≡ u + ’ +.

43 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sort Order Varies By: ➢ Language – Swedish: z < ö – German: ö < z ➢ Usage – Dictionary: öf < of – Telephone: of < öf ➢ Customizations – A < a – a < A ➢ Versioning – Fixes – New Gov. Stds – New Characters

44 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sorting Is Context Sensitive ➢ Contractions – H < Z, but CZ < CH ➢ Expansions – OE < Œ < OF ➢ Both – カー < カイ – キー > キイ

45 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sorting Strength Levels 1. Base characters: a < b 2. Accents: as < às < at –ignored if there is a L1 character difference 3. Case: ao < Ao < aò –ignored if there is a L1 or L2 difference 4. Punctuation: ab < a-b < aB –ignored* if there is a L1, L2, or L3 difference 5. Tie-breaker: NFD code point order

46 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Other Sorting Issues ➢ Normal accents –cote < coté < côte < côté first accent difference determines order ➢ French accents –cote < côte < coté < côté last accent difference determines order ➢ Logical Order Exception (Thai, Lao) – เ ก sorts like ก เ

47 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Unicode Collation Algorithm (UCA) ➢ UTS #10: Unicode Collation Algorithm UTS #10: Unicode Collation Algorithm –Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc. –Default ordering: all Unicode code points –Provides for tailoring to given languages –Also see: The Unicode Standard, §5.17: Sorting and Searching ➢ Aligned with ISO 14651

48 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Collation Elements (CEs) ➢ Ordering established by weights ➢ Weights encoded in Collation Elements –[ primary weight ](e.g. base character) –[ secondary weight ](e.g. accents) –[ tertiary weight ](e.g. case-level) –[ quaternary weight ](e.g. punctuation) ➢ Must be accessed sequentially –Characters to CEs not 1:1 ➢ Canonically equivalent characters have same CEs

49 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sort Keys ➢ Transform string into series of bytes which will binary-compare –a: 06 C –A: 06 C –á: 06 C –ab: 06 C3 06 D –b: 06 D Level 3 Level 2 Level 1

50 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Matching With UCA ➢ Compare CEs for both strings ➢ Adjust for strength ➢ Access sequentially ➢ Stop on first mismatch ➢ Number of characters, CEs may differ ➢ No match if number of CEs different

51 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Comparing And Sorting With UCA ➢ Compare CEs – Sequential – Best performance for single compare – Must have same number of CEs ➢ Compare sort keys – Best performance for multiple compares

52 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 String Search And Unicode Text ➢ Canonical equivalence – Find “à” in “a ◌ ̀” and “a ◌ ̀” in “à” ➢ Expansions & contractions – “ß” = “ss”, “å” = “aa” – “ch” is one character ➢ “Whole character” match – Don’t find “a” in “a ◌ ̀” or “c” in “ch” ➢ Length of pattern, match may differ

53 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 String Search And Collation ➢ Solves some problems: – Same CEs for “à” and “a ◌ ̀” – Same CEs for “ß” and “ss” (at level 1) – Same CEs for “å” and “aa” (at level 1) – Won’t find “c” in “ch” – Same length of pattern, match ➢ Doesn’t solve others: – CE for “a” also 1st CE for “à” and “a ◌ ̀”

54 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 String Search And Collation, cont. ➢ CEs: – Expensive to generate – Cheap to compare – Sequential access – Mapping to character index approximate ➢ “Whole character” match: – Not enough information – Use character properties

55 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Linear Search And Collation ➢ Convert pattern to CEs up front ➢ Sequential access a good fit –Can search forwards or backwards ➢ May read a give CE more than once – Use circular buffer for performance ➢ Easy to find match bounds – Validate for “whole character” match

56 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore Search And Collation ➢ Use pattern CEs to compute skip tables – CE “alphabet” large – Use a hash function ➢ Fetch target CEs backwards – Even for backwards search ➢ Access pattern makes buffering difficult

57 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore Search And Collation –Text: “My fußball table” –Pattern: “fuss” Preprocessing of Pattern: Bad Character Shift Table Good Suffix Shift Table LetterShift f3 u2 s1 other4 # of Matches PatternShift s1 2-ss4 3-uss4

58 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore And Collation Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuss LetterShift f3 u2 s1 other4

59 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore And Collation Example –T–Text: “My fußball table” –P–Pattern: “fuss” My fußball table fuss OOPS LetterShift f3 u2 s1 other4

60 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Solving The Skipping Problem ➢ New function: minLengthInChars – Shortest string generating same CEs – Always treat pattern as shortest – May not always skip as far as it could – Will never skip too far

61 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” (treat as “fuß”) Preprocessing of Pattern: Bad Character Shift Table Good Suffix Shift Table LetterShift f2 u1 other3 # of Matches PatternShift ß3 2-uß3

62 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß LetterShift f2 u1 other3

63 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß

64 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß

65 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß Match Found

66 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation ➢ Search pattern forward – Fastest way to get CEs ➢ Can also search backward – Slower than forward ➢ Other search orders: – More expensive pre-processing – Non-sequential access expensive

67 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Example –Text: “My fussball table” –Pattern: “fussball” (treated as “fußball”) Preprocessing of Pattern: Bad Character Shift Table LetterShift f7 u6 s5 b4 a3 l1 other8

68 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Example –Text: “My fussball table” –Pattern: “fussball” My fussball table fußball LetterShift f7 u6 s5 b4 a3 l1 other8

69 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Example –T–Text: “My fussball table” –P–Pattern: “fussball” My fussball table fußball OOPS LetterShift f7 u6 s5 b4 a3 l1 other8

70 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Conclusion ➢ Must compare backwards ➢ More expensive to fetch CE after pattern – Non-sequential access – Character after might generate multiple CEs ➢ Boyer-Moore seems like a better fit – No need to fetch extra CE – Sequential access between skips

71 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Summary Matching, sorting, and searching are essential text handling tools Using the character code points is not sufficient Implementing the Unicode standard and the Unicode Collation Algorithm is the way to go Special considerations during implementation

72 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Questions ?

73 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 More Information ➢ Unicode Collation Algorithm – ➢ ICU –