Presentation is loading. Please wait.

Presentation is loading. Please wait.

32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.

Similar presentations


Presentation on theme: "32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008."— Presentation transcript:

1 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Matching, Sorting, and Searching with Unicode Text Eric R. Mader, Michael Ow IBM Globalization Architecture and Technology

2 232nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Topics ➢ Matching –Are two characters the same? ➢ Sorting –Proper ordering of characters ➢ Searching –Finding the desired pattern in a given text

3 332nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Different Ways to Implement ➢ “Non-Globalized Way” –Concern with individual language only –Using codepages that are language specific or limited ➢ “Globalized Way” –Using Unicode –Using Unicode Collation Allgorithm –Consideration for all languages

4 432nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Non-Globalized Way –Matching using the individual code points a (0x61) = a (0x61) * using US-ASCII

5 532nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Non-Globalized Way, cont. –Sorting can also be done by comparing code points or a simple mapping of the code points (e.g. in EBDIC) Unsorted Sorted CharacterCode Point b0x62 a0x61 d0x64 c0x63 CharacterCode Point a0x61 b0x62 c0x63 d0x64 *Using US-ASCII

6 632nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Non-Globalized Way, cont. ➢ Searching –Many different algorithms in string search* –Searching of text done through analysis of character code points only *algorithms can be globalized

7 732nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Basics of String Search Algorithms –There are many ways to search through text: Linear Search Boyer-Moore Quick Search * There are many more, but for time constraints we will discuss the common ones that are listed above.

8 832nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Simple String Search Algorithm ➢ Linear Search –Brute force search –Check every character against pattern –Very slow –No preprocessing time –Performance: O (m * n) * m is the size of the pattern * n is the size of the text

9 932nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

10 10 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

11 11 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

12 12 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

13 13 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

14 14 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

15 15 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

16 16 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

17 17 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

18 18 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field

19 19 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Linear Search –Text: “Cloverfield” –Pattern: “field” Cloverfield field Match Found

20 20 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Better String Search Algorithm ➢ Boyer-Moore –Intelligently skips characters in the text based on the pattern –Very fast –Preprocessing time: O(m + |∑|) –Performance: O(n/m) * m is the size of the pattern * n is the size of the text

21 21 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Preprocessing of Pattern: Bad Character Shift Table Good Suffix Shift Table LetterShift f4 i3 e2 l1 other5 # of Matches PatternShift 0-1 1-d5 2-ld5 3-eld5 4-ield5

22 22 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field LetterShift f4 i3 e2 l1 other5

23 23 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

24 24 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

25 25 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

26 26 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field # of Matches PatternShift 0-1 1-d5 2-ld5 3-eld5 4-ield5

27 27 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

28 28 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

29 29 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

30 30 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

31 31 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Boyer-Moore Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field Match Found

32 32 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Another Fast String Search Algorithm ➢ Quick Search –Skips based on character after pattern Can compare in any order –Very fast –Preprocessing time: O(m + |∑|) –Performance: O(n) for average case * m is the size of the pattern * n is the size of the text

33 33 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Preprocessing of Pattern: Bad Character Shift Table LetterShift f5 i4 e3 l2 d1 other6

34 34 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field LetterShift f5 i4 e3 l2 d1 other6

35 35 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field LetterShift f5 i4 e3 l2 d1 other6

36 36 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

37 37 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

38 38 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

39 39 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field

40 40 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Example of Quick Search –Text: “Cloveldfield” –Pattern: “field” Cloveldfield field Match Found

41 41 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 The Unicode Standard ➢ Features : Single encoding for all languages Encodes over 90,000 characters ➢ Issues: Canonical equivalence More than one sort order Sorting is context sensitive Sorting strength levels Other sorting issues

42 42 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Canonical Equivalence Å≡Å ≡A + º x +. + ^≡x + ^ +. ự≡u + ’ ≡ư +. ≡ụ + ’ ≡ u +. + ’ ≡ u + ’ +.

43 43 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sort Order Varies By: ➢ Language – Swedish: z < ö – German: ö < z ➢ Usage – Dictionary: öf < of – Telephone: of < öf ➢ Customizations – A < a – a < A ➢ Versioning – Fixes – New Gov. Stds – New Characters

44 44 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sorting Is Context Sensitive ➢ Contractions – H < Z, but CZ < CH ➢ Expansions – OE < Œ < OF ➢ Both – カー < カイ – キー > キイ

45 45 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sorting Strength Levels 1. Base characters: a < b 2. Accents: as < às < at –ignored if there is a L1 character difference 3. Case: ao < Ao < aò –ignored if there is a L1 or L2 difference 4. Punctuation: ab < a-b < aB –ignored* if there is a L1, L2, or L3 difference 5. Tie-breaker: NFD code point order

46 46 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Other Sorting Issues ➢ Normal accents –cote < coté < côte < côté first accent difference determines order ➢ French accents –cote < côte < coté < côté last accent difference determines order ➢ Logical Order Exception (Thai, Lao) – เ ก sorts like ก เ

47 47 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Unicode Collation Algorithm (UCA) ➢ UTS #10: Unicode Collation Algorithm UTS #10: Unicode Collation Algorithm –Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc. –Default ordering: all Unicode code points –Provides for tailoring to given languages –Also see: The Unicode Standard, §5.17: Sorting and Searching ➢ Aligned with ISO 14651

48 48 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Collation Elements (CEs) ➢ Ordering established by weights ➢ Weights encoded in Collation Elements –[ primary weight ](e.g. base character) –[ secondary weight ](e.g. accents) –[ tertiary weight ](e.g. case-level) –[ quaternary weight ](e.g. punctuation) ➢ Must be accessed sequentially –Characters to CEs not 1:1 ➢ Canonically equivalent characters have same CEs

49 49 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Sort Keys ➢ Transform string into series of bytes which will binary-compare –a: 06 C3 01 20 01 02 00 –A: 06 C3 01 20 01 08 00 –á: 06 C3 01 20 32 01 02 02 00 –ab: 06 C3 06 D7 01 20 20 01 02 02 00 –b: 06 D7 01 20 01 02 00 Level 3 Level 2 Level 1

50 50 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Matching With UCA ➢ Compare CEs for both strings ➢ Adjust for strength ➢ Access sequentially ➢ Stop on first mismatch ➢ Number of characters, CEs may differ ➢ No match if number of CEs different

51 51 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Comparing And Sorting With UCA ➢ Compare CEs – Sequential – Best performance for single compare – Must have same number of CEs ➢ Compare sort keys – Best performance for multiple compares

52 52 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 String Search And Unicode Text ➢ Canonical equivalence – Find “à” in “a ◌ ̀” and “a ◌ ̀” in “à” ➢ Expansions & contractions – “ß” = “ss”, “å” = “aa” – “ch” is one character ➢ “Whole character” match – Don’t find “a” in “a ◌ ̀” or “c” in “ch” ➢ Length of pattern, match may differ

53 53 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 String Search And Collation ➢ Solves some problems: – Same CEs for “à” and “a ◌ ̀” – Same CEs for “ß” and “ss” (at level 1) – Same CEs for “å” and “aa” (at level 1) – Won’t find “c” in “ch” – Same length of pattern, match ➢ Doesn’t solve others: – CE for “a” also 1st CE for “à” and “a ◌ ̀”

54 54 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 String Search And Collation, cont. ➢ CEs: – Expensive to generate – Cheap to compare – Sequential access – Mapping to character index approximate ➢ “Whole character” match: – Not enough information – Use character properties

55 55 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Linear Search And Collation ➢ Convert pattern to CEs up front ➢ Sequential access a good fit –Can search forwards or backwards ➢ May read a give CE more than once – Use circular buffer for performance ➢ Easy to find match bounds – Validate for “whole character” match

56 56 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore Search And Collation ➢ Use pattern CEs to compute skip tables – CE “alphabet” large – Use a hash function ➢ Fetch target CEs backwards – Even for backwards search ➢ Access pattern makes buffering difficult

57 57 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore Search And Collation –Text: “My fußball table” –Pattern: “fuss” Preprocessing of Pattern: Bad Character Shift Table Good Suffix Shift Table LetterShift f3 u2 s1 other4 # of Matches PatternShift 0-1 1-s1 2-ss4 3-uss4

58 58 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore And Collation Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuss LetterShift f3 u2 s1 other4

59 59 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Boyer-Moore And Collation Example –T–Text: “My fußball table” –P–Pattern: “fuss” My fußball table fuss OOPS LetterShift f3 u2 s1 other4

60 60 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Solving The Skipping Problem ➢ New function: minLengthInChars – Shortest string generating same CEs – Always treat pattern as shortest – May not always skip as far as it could – Will never skip too far

61 61 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” (treat as “fuß”) Preprocessing of Pattern: Bad Character Shift Table Good Suffix Shift Table LetterShift f2 u1 other3 # of Matches PatternShift 0-1 1-ß3 2-uß3

62 62 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß LetterShift f2 u1 other3

63 63 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß

64 64 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß

65 65 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 minLengthInChars Example –Text: “My fußball table” –Pattern: “fuss” My fußball table fuß Match Found

66 66 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation ➢ Search pattern forward – Fastest way to get CEs ➢ Can also search backward – Slower than forward ➢ Other search orders: – More expensive pre-processing – Non-sequential access expensive

67 67 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Example –Text: “My fussball table” –Pattern: “fussball” (treated as “fußball”) Preprocessing of Pattern: Bad Character Shift Table LetterShift f7 u6 s5 b4 a3 l1 other8

68 68 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Example –Text: “My fussball table” –Pattern: “fussball” My fussball table fußball LetterShift f7 u6 s5 b4 a3 l1 other8

69 69 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Example –T–Text: “My fussball table” –P–Pattern: “fussball” My fussball table fußball OOPS LetterShift f7 u6 s5 b4 a3 l1 other8

70 70 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Quick Search And Collation Conclusion ➢ Must compare backwards ➢ More expensive to fetch CE after pattern – Non-sequential access – Character after might generate multiple CEs ➢ Boyer-Moore seems like a better fit – No need to fetch extra CE – Sequential access between skips

71 71 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Summary Matching, sorting, and searching are essential text handling tools Using the character code points is not sufficient Implementing the Unicode standard and the Unicode Collation Algorithm is the way to go Special considerations during implementation

72 72 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 Questions ?

73 73 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008 More Information ➢ Unicode Collation Algorithm –http://unicode.org/reports/tr10 ➢ ICU –http://www.icu-project.org


Download ppt "32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008."

Similar presentations


Ads by Google