32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Matching, Sorting, and Searching with Unicode Text Eric R. Mader, Michael Ow IBM Globalization Architecture and Technology
232nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Topics ➢ Matching –Are two characters the same? ➢ Sorting –Proper ordering of characters ➢ Searching –Finding the desired pattern in a given text
332nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Different Ways to Implement ➢ “Non-Globalized Way” –Concern with individual language only –Using codepages that are language specific or limited ➢ “Globalized Way” –Using Unicode –Using Unicode Collation Allgorithm –Consideration for all languages
432nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Non-Globalized Way –Matching using the individual code points a (0x61) = a (0x61) * using US-ASCII
532nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Non-Globalized Way, cont. –Sorting can also be done by comparing code points or a simple mapping of the code points (e.g. in EBDIC) Unsorted Sorted CharacterCode Point b0x62 a0x61 d0x64 c0x63 CharacterCode Point a0x61 b0x62 c0x63 d0x64 *Using US-ASCII
632nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Non-Globalized Way, cont. ➢ Searching –Many different algorithms in string search* –Searching of text done through analysis of character code points only *algorithms can be globalized
732nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Basics of String Search Algorithms –There are many ways to search through text: Linear Search Boyer-Moore Quick Search * There are many more, but for time constraints we will discuss the common ones that are listed above.
832nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Simple String Search Algorithm ➢ Linear Search –Brute force search –Check every character against pattern –Very slow –No preprocessing time –Performance: O (m * n) * m is the size of the pattern * n is the size of the text
932nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
10 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
11 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
12 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
13 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
14 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
15 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
16 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
17 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
18 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field
19 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field Match Found
20 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Better String Search Algorithm ➢ Boyer-Moore –Intelligently skips characters in the text based on the pattern –Very fast –Preprocessing time: O(m + |∑|) –Performance: O(n/m) * m is the size of the pattern * n is the size of the text
21 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Preprocessing of Pattern: Bad Character Shift Table Good Suffix Shift Table LetterShift f4 i3 e2 l1 other5 # of Matches PatternShift d5 2-ld5 3-eld5 4-ield5
22 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field LetterShift f4 i3 e2 l1 other5
23 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
24 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
25 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
26 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field # of Matches PatternShift d5 2-ld5 3-eld5 4-ield5
27 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
28 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
29 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
30 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
31 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field Match Found
32 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Another Fast String Search Algorithm ➢ Quick Search –Skips based on character after pattern Can compare in any order –Very fast –Preprocessing time: O(m + |∑|) –Performance: O(n) for average case * m is the size of the pattern * n is the size of the text
33 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Preprocessing of Pattern: Bad Character Shift Table LetterShift f5 i4 e3 l2 d1 other6
34 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field LetterShift f5 i4 e3 l2 d1 other6
35 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field LetterShift f5 i4 e3 l2 d1 other6
36 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
37 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
38 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
39 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field
40 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field Match Found
41 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 The Unicode Standard ➢ Features : Single encoding for all languages Encodes over 90,000 characters ➢ Issues: Canonical equivalence More than one sort order Sorting is context sensitive Sorting strength levels Other sorting issues
42 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Canonical Equivalence Å≡Å ≡A + º x +. + ^≡x + ^ +. ự≡u + ’ ≡ư +. ≡ụ + ’ ≡ u +. + ’ ≡ u + ’ +.
43 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sort Order Varies By: ➢ Language – Swedish: z < ö – German: ö < z ➢ Usage – Dictionary: öf < of – Telephone: of < öf ➢ Customizations – A < a – a < A ➢ Versioning – Fixes – New Gov. Stds – New Characters
44 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sorting Is Context Sensitive ➢ Contractions – H < Z, but CZ < CH ➢ Expansions – OE < Œ < OF ➢ Both – カー < カイ – キー > キイ
45 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sorting Strength Levels 1. Base characters: a < b 2. Accents: as < às < at –ignored if there is a L1 character difference 3. Case: ao < Ao < aò –ignored if there is a L1 or L2 difference 4. Punctuation: ab < a-b < aB –ignored* if there is a L1, L2, or L3 difference 5. Tie-breaker: NFD code point order
46 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Other Sorting Issues ➢ Normal accents –cote < coté < côte < côté first accent difference determines order ➢ French accents –cote < côte < coté < côté last accent difference determines order ➢ Logical Order Exception (Thai, Lao) – เ ก sorts like ก เ
47 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Unicode Collation Algorithm (UCA) ➢ UTS #10: Unicode Collation Algorithm UTS #10: Unicode Collation Algorithm –Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc. –Default ordering: all Unicode code points –Provides for tailoring to given languages –Also see: The Unicode Standard, §5.17: Sorting and Searching ➢ Aligned with ISO 14651
48 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Collation Elements (CEs) ➢ Ordering established by weights ➢ Weights encoded in Collation Elements –[ primary weight ](e.g. base character) –[ secondary weight ](e.g. accents) –[ tertiary weight ](e.g. case-level) –[ quaternary weight ](e.g. punctuation) ➢ Must be accessed sequentially –Characters to CEs not 1:1 ➢ Canonically equivalent characters have same CEs
49 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sort Keys ➢ Transform string into series of bytes which will binary-compare –a: 06 C –A: 06 C –á: 06 C –ab: 06 C3 06 D –b: 06 D Level 3 Level 2 Level 1
50 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Matching With UCA ➢ Compare CEs for both strings ➢ Adjust for strength ➢ Access sequentially ➢ Stop on first mismatch ➢ Number of characters, CEs may differ ➢ No match if number of CEs different
51 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Comparing And Sorting With UCA ➢ Compare CEs – Sequential – Best performance for single compare – Must have same number of CEs ➢ Compare sort keys – Best performance for multiple compares
52 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 String Search And Unicode Text ➢ Canonical equivalence – Find “à” in “a ◌ ̀” and “a ◌ ̀” in “à” ➢ Expansions & contractions – “ß” = “ss”, “å” = “aa” – “ch” is one character ➢ “Whole character” match – Don’t find “a” in “a ◌ ̀” or “c” in “ch” ➢ Length of pattern, match may differ
53 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 String Search And Collation ➢ Solves some problems: – Same CEs for “à” and “a ◌ ̀” – Same CEs for “ß” and “ss” (at level 1) – Same CEs for “å” and “aa” (at level 1) – Won’t find “c” in “ch” – Same length of pattern, match ➢ Doesn’t solve others: – CE for “a” also 1st CE for “à” and “a ◌ ̀”
54 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 String Search And Collation, cont. ➢ CEs: – Expensive to generate – Cheap to compare – Sequential access – Mapping to character index approximate ➢ “Whole character” match: – Not enough information – Use character properties
55 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Linear Search And Collation ➢ Convert pattern to CEs up front ➢ Sequential access a good fit –Can search forwards or backwards ➢ May read a give CE more than once – Use circular buffer for performance ➢ Easy to find match bounds – Validate for “whole character” match
56 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore Search And Collation ➢ Use pattern CEs to compute skip tables – CE “alphabet” large – Use a hash function ➢ Fetch target CEs backwards – Even for backwards search ➢ Access pattern makes buffering difficult
57 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore Search And Collation –Text: “My fußball table” –Pattern: “fuss” Preprocessing of Pattern: Bad Character Shift Table Good Suffix Shift Table LetterShift f3 u2 s1 other4 # of Matches PatternShift s1 2-ss4 3-uss4
58 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore And Collation Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuss LetterShift f3 u2 s1 other4
59 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore And Collation Example –T–Text: “My fußball table” –P–Pattern: “fuss” My fußball table fuss OOPS LetterShift f3 u2 s1 other4
60 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Solving The Skipping Problem ➢ New function: minLengthInChars – Shortest string generating same CEs – Always treat pattern as shortest – May not always skip as far as it could – Will never skip too far
61 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” (treat as “fuß”) Preprocessing of Pattern: Bad Character Shift Table Good Suffix Shift Table LetterShift f2 u1 other3 # of Matches PatternShift ß3 2-uß3
62 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß LetterShift f2 u1 other3
63 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß
64 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß
65 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß Match Found
66 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation ➢ Search pattern forward – Fastest way to get CEs ➢ Can also search backward – Slower than forward ➢ Other search orders: – More expensive pre-processing – Non-sequential access expensive
67 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Example –Text: “My fussball table” –Pattern: “fussball” (treated as “fußball”) Preprocessing of Pattern: Bad Character Shift Table LetterShift f7 u6 s5 b4 a3 l1 other8
68 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Example –Text: “My fussball table” –Pattern: “fussball” My fussball table fußball LetterShift f7 u6 s5 b4 a3 l1 other8
69 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Example –T–Text: “My fussball table” –P–Pattern: “fussball” My fussball table fußball OOPS LetterShift f7 u6 s5 b4 a3 l1 other8
70 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Conclusion ➢ Must compare backwards ➢ More expensive to fetch CE after pattern – Non-sequential access – Character after might generate multiple CEs ➢ Boyer-Moore seems like a better fit – No need to fetch extra CE – Sequential access between skips
71 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Summary Matching, sorting, and searching are essential text handling tools Using the character code points is not sufficient Implementing the Unicode standard and the Unicode Collation Algorithm is the way to go Special considerations during implementation
72 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Questions ?
73 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 More Information ➢ Unicode Collation Algorithm – ➢ ICU –