Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 Sorting and Searching Helena Shih GCoC Manager IBM.

Slides:



Advertisements
Similar presentations
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Advertisements

Analysis of Algorithms
Space-for-Time Tradeoffs
Hashing as a Dictionary Implementation
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
Elementary Data Types Prof. Alamdeep Singh. Scalar Data Types Scalar data types represent a single object, i.e. only one value can be derived. In general,
Hashing Chapters What is Hashing? A technique that determines an index or location for storage of an item in a data structure The hash function.
Using arrays – Example 2: names as keys How do we map strings to integers? One way is to convert each letter to a number, either by mapping them to 0-25.
CIS 234: Using Data in Java Thanks to Dr. Ralph D. Westfall.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Searching Arrays Linear search Binary search small arrays
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
CENG 311 Machine Representation/Numbers
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
CIS162AD - C# Decision Statements 04_decisions.ppt.
Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:
General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
TECH Computer Science Dynamic Sets and Searching Analysis Technique  Amortized Analysis // average cost of each operation in the worst case Dynamic Sets.
Hash Tables1   © 2010 Goodrich, Tamassia.
Application: String Matching By Rong Ge COSC3100
26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.
MA/CSSE 473 Day 27 Hash table review Intro to string searching.
“Enthusiasm releases the drive to carry you over obstacles and adds significance to all you do.” – Norman Vincent Peale Thought for the Day.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Hashing as a Dictionary Implementation Chapter 19.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
“Never doubt that a small group of thoughtful, committed people can change the world. Indeed, it is the only thing that ever has.” – Margaret Meade Thought.
COMP 103 Bitsets. 2 Sets, and more Sets!  Unsorted Array  Sorted ArrayO(n) for at least one of  Linked Listcontains, add, remove  Binary Search TreeO(log.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
String Matching By Joshua Yudaken. Terms Haystack A string in which to search Needle The string being searched for  find the needle in the haystack.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 ICU Low-level Utilities and Resource Management Vladimir Weinstein Globalization Center.
A FIRST BOOK OF C++ CHAPTER 14 THE STRING CLASS AND EXCEPTION HANDLING.
Data Structures Arrays and Lists Part 2 More List Operations.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Java Basics. Tokens: 1.Keywords int test12 = 10, i; int TEst12 = 20; Int keyword is used to declare integer variables All Key words are lower case java.
OPERATORS IN C CHAPTER 3. Expressions can be built up from literals, variables and operators. The operators define how the variables and literals in the.
Hashing & HashMaps CS-2851 Dr. Mark L. Hornick.
Dictionaries 9/14/ :35 AM Hash Tables   4
Searching.
Advanced Associative Structures
Space-for-time tradeoffs
The compareTo interface
Chapter 7 Space and Time Tradeoffs
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Fundamentals of Data Representation
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Space-for-time tradeoffs
Data Structures Sorted Arrays
Hash Tables Computer Science and Engineering
Space-for-time tradeoffs
Space-for-time tradeoffs
ASCII and Unicode.
Presentation transcript:

Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 Sorting and Searching Helena Shih GCoC Manager IBM

Cupertino, CA, USA / September, 2000First ICU Developer Workshop2 Agenda What is language-sensitive collation? An overview of the ICU collation components. How to add customized collation rules? What is the collation versioning mechanism? How to do searching with ICU collation APIs? What’s the difference between ICU, JDK and ICU4J? Examples and exercises.

Cupertino, CA, USA / September, 2000First ICU Developer Workshop3 Introduction How hard can this be? Isn’t Unicode just another character set? Accented characters: –minor variants: e vs. é vs. e´ –distinct letters: Å sorts after Z and Æ in Danish –two letters: ä is ae in traditional German

Cupertino, CA, USA / September, 2000First ICU Developer Workshop4 Introduction How hard can this be? Isn’t Unicode just another character set? Accented characters Expanding and contracting characters: –German ß treated as ss –Spanish ch treated as single letter after c

Cupertino, CA, USA / September, 2000First ICU Developer Workshop5 Introduction How hard can this be? Isn’t Unicode just another character set? Accented characters Expanding and contracting characters Ignorable characters: –blackbird vs. black–bird –The “–” is ignorable

Cupertino, CA, USA / September, 2000First ICU Developer Workshop6 Collation in ICU Simple, data-driven, rule based collation Rule support for more than 35 languages Correct handling of the accents, expansion, contraction and so on Easily customizable for your needs Offering both incremental comparison for simple comparison and collation keys for batch processes

Cupertino, CA, USA / September, 2000First ICU Developer Workshop7 Examples C++: UErrorCode status = U_ZERO_ERROR; Collator *myCollator = Collator::createInstance(Locale::US, status); if (FAILURE(status)) { cout << “Failed to create a US collator.\n”; return; } delete myCollator; C: UErrorCode status = U_ZERO_ERROR; Collator *myCollator = ucol_open(ULOC_US, &status); if (FAILURE(status)) { printf(“Failed to create a US collator.\n”); return; } ucol_close(myCollator);

Cupertino, CA, USA / September, 2000First ICU Developer Workshop8 Extended Example C++: UErrorCode status = U_ZERO_ERROR; Collator *myCollator = Collator::createInstance(Locale::US, status); if (FAILURE(status)) { cout setStrength(Collator::PRIMARY); if (collator.compare("café", "cafe") == 0) { cout << “Success!! Strings compare as equal.\n”; } delete myCollator;

Cupertino, CA, USA / September, 2000First ICU Developer Workshop9 Collation Options Collation strength: –PRIMARY: a letter difference, e.g. 'a' and 'b'. –SECONDARY: an accent difference, e.g. 'ä' and 'å'. –TERTIARY: a case difference, e.g. 'a' and 'A'. –IDENTICAL: bitwise equality, e.g. 'a' and 'a'. Normalization mode: –NO_OP: no normalization –COMPOSE: UTR 15 form C. –COMPOSE_COMPAT: UTR 15 form KC. –DECOMP: UTR 15 form D. –DECOMP_COMPAT: UTR 15 form KD.

Cupertino, CA, USA / September, 2000First ICU Developer Workshop10 Secrets Behind the Scene The string is converted to a list of “collation elements”. Each element contains 3 components: primary, secondary and tertiary. Example:

Cupertino, CA, USA / September, 2000First ICU Developer Workshop11 CollationElementIterator Direct access to collation elements: UErrorCode status = U_ZERO_ERROR; RuleBasedCollator *myCollator = (RuleBasedCollator*)Collator::createInstance(Locale::US, status); CollationElementIterator *iter = myCollator->createCollationElementIterator("café"); int32_t elem; while ((elem = iter.next(status)) != CollationElementIterator::NULLORDER) { if (U_FAILURE(status)) return; cout << “Element is:” << itoa(elem, 16) << ‘\n’; cout << “ primary:” << itoa(CollationElementIterator::primaryOrder(elem), 16) << ‘\n’; } delete iter; delete myCollator;

Cupertino, CA, USA / September, 2000First ICU Developer Workshop12 The rule symbols and their usage French secondary '<' : Greater, as a primary difference ';' : Greater, as an secondary difference ',' : Greater, as a tertiary difference '=' : Equal, no difference '&‘ : Reset All punctuation symbols in ASCII range are reserved

Cupertino, CA, USA / September, 2000First ICU Developer Workshop13 Examples RulesComparison Examples " a, A < b, B < c, C < ch, cH, Ch, CH < d, D < e, E" "abc" <<< "ABC" "achb" < "adb" " a, A < b, B < c, C < d, D < e, E & AE; ä ""aeb" << "äb" "acb" < "äb" ".... q, Q & Question'-'mark = '?'....""?" == "Question-mark" ".... & aa ; a- & ee ; e- & ii ; i- & oo ; o- & uu ; u-...." "baab" << "ba-b" Note: ‘<<<‘ : tertiary difference ‘<<‘ : secondary difference ‘<‘ : primary difference ‘==‘ : no difference

Cupertino, CA, USA / September, 2000First ICU Developer Workshop14 Collation and ResourceBundle Collation rules can be overwritten completely (or not). Two sets of version information provided: –Data: “CollationElement”:”Version” tag from ResourceBundle –Code: Collator::getVersion() or ucol_getVersion().

Cupertino, CA, USA / September, 2000First ICU Developer Workshop15 ResourceBundle Example { CollationElements { Version { "1.0" } Override { "FALSE" } Sequence { "& A < \u00E6\u0301, \u00C6\u0301& Z < \u00E6, \u00C6;" " a\u0308, A\u0308 < \u00F8, \u00D8 ; o\u0308, O\u0308 ; o\u030B, O\u030B< a\u030A" ", A\u030A, aa, aA, Aa, AA & V, w, W & Y ; u\u0308, U\u0308" } }

Cupertino, CA, USA / September, 2000First ICU Developer Workshop16 Searching in ICU Compare “collation elements” not characters Brute-force search works

Cupertino, CA, USA / September, 2000First ICU Developer Workshop17 Comparing Collation Elements UBool match(const CollationElementIterator* text, const CollationElementIterator *pattern) { UErrorCode status = U_ZERO_ERROR; while (TRUE) { int32_t patElem = pattern->next(status); int32_t textElem = text->next(status); if (U_FAILURE(status)) return FALSE; if (patElem == CollationElementIterator::NULLORDER) { return TRUE; // End of the pattern } else if (patElem != textElem) { return FALSE; // Mismatch }

Cupertino, CA, USA / September, 2000First ICU Developer Workshop18 Simple Search Example UnicodeString text("Now is the time for all good women“); UnicodeString pattern("for“); CollationElementIterator *patternIter = myCollator::createCollationElementIterator(pattern); CollationElementIterator *textIter = myCollator::createCollationElementIterator(text); for (int32_t i = 0; i < text.length(); i++) { textIter->setOffset(i, status); patternIter.reset(); if (match(patternIter, textIter)) { // Found a match at position i } delete patternIter; delete textIter;

Cupertino, CA, USA / September, 2000First ICU Developer Workshop19 What’s Wrong? match() treats any difference as significant –Won't find résumé if searching for resume –Won't find ß if searching for ss ….

Cupertino, CA, USA / September, 2000First ICU Developer Workshop20 What’s Wrong? match() treats any difference as significant Ignorable characters aren’t ignored –Won’t find black–bird if searching for blackbird

Cupertino, CA, USA / September, 2000First ICU Developer Workshop21 Collation Element An ICU collation element is a 32-bit integer –16 high bits for the primary portion –8 bits for secondary –8 low bits for tertiary Use bitmasks to implement weights: int32_t getMask(Collator::ECollationStrength weight) { switch (weight) { case Collator.PRIMARY:return 0xFFFF0000; case Collator.SECONDARY:return 0xFFFFFF00; default:return 0xFFFFFFFF; }

Cupertino, CA, USA / September, 2000First ICU Developer Workshop22 Updated Match() I UBool match(const CollationElementIterator* text, const CollationElementIterator *pattern, Collator::ECollationStrength weight) { UErrorCode status = U_ZERO_ERROR; int32_t mask = getMask(weight); while (TRUE) { int32_t patElem = pattern->next(status); int32_t textElem = text->next(status); if (U_FAILURE(status)) return FALSE; if (patElem == CollationElementIterator::NULLORDER) { return TRUE; // End of the pattern } else if ((patElem & mask) != (textElem & mask)) { return FALSE; // Mismatch }

Cupertino, CA, USA / September, 2000First ICU Developer Workshop23 Ignorable Characters Still don’t handle ignorable characters, e.g. the ‘–’ in “black–bird” Accented letters can be represented in two different ways: –Precomposed character: é(00E9) –Combining sequence: e + ´( )

Cupertino, CA, USA / September, 2000First ICU Developer Workshop24 Ignorable the Element Accents have collation elements too: –e –e ´ For primary weight, mask with FFFF0000: –e –e ´ Hyphen works the same way ––

Cupertino, CA, USA / September, 2000First ICU Developer Workshop25 Update match() to Ignore Elements e d b Pattern b l a c k b i r d Target b l a c k – b i r d e d e d

Cupertino, CA, USA / September, 2000First ICU Developer Workshop26 Boyer-Moore searching silly_spring_string string Start comparing at the end. The space in the text doesn't match the "g" There is no space anywhere in the pattern, so we can advance by six characters rather than just one.

Cupertino, CA, USA / September, 2000First ICU Developer Workshop27 Boyer-Moore searching silly_spring_string string "p" and "t" do not match There is no "p" in the pattern, so we can advance by two.

Cupertino, CA, USA / September, 2000First ICU Developer Workshop28 Boyer-Moore searching silly_spring_string string "s" and "g" do not match We know there is an "s" at the start of the pattern

Cupertino, CA, USA / September, 2000First ICU Developer Workshop29 Boyer-Moore searching silly_spring_string string We found a match! There were 13 comparisons, vs. 21 for the brute-force approach. A less-contrived example would be even better: fewer spurious matches.

Cupertino, CA, USA / September, 2000First ICU Developer Workshop30 Shifting How do you know how far to shift? Precomputed shift tables: Computing the tables: –Value is how far from the end of the pattern a character occurs –If it occurs twice, take the lower number

Cupertino, CA, USA / September, 2000First ICU Developer Workshop31 Shifting Issues If you shift the pattern too far, you can miss a valid match in the text. Shifting too little only hurts performance. When in doubt, less is better –If a character occurs twice in the pattern, use the lesser of the two shift distances.

Cupertino, CA, USA / September, 2000First ICU Developer Workshop32 Boyer-Moore and Unicode Traditional Boyer-Moore shift table indices are 1- byte characters (256 entries). Unicode is too big: table would have entries. Collation elements have over 4 billion possible values.

Cupertino, CA, USA / September, 2000First ICU Developer Workshop33 Hashing Large character sets can be collapsed to a manageable size with hashing. Shift table indices are hash values, not characters. Collision? Use the smaller shift distance!

Cupertino, CA, USA / September, 2000First ICU Developer Workshop34 Example

Cupertino, CA, USA / September, 2000First ICU Developer Workshop35 Hash Functions Simple hash functions are fine: static int hash(int element) { return (element >>> 16) % 0x00FF; } Complicated ones may be slightly better: static int hash(int element) { return ((element >>> 16) * ) % 251; }

Cupertino, CA, USA / September, 2000First ICU Developer Workshop36 Searching Mechanism Build Shift Table textIndex < length? textIndex = patLength Not Found patIndex = patLength Pattern Matches? Found! textIndex += shift yes no

Cupertino, CA, USA / September, 2000First ICU Developer Workshop37 ICU and Java The collation service is built-in for Sun’s JDK. Parallel design/architecture/resources of collation service for ICU and Java. Additional enhancements may be in ICU4C and IBM’s JVM only. The searching service is available via ICU4J.

Cupertino, CA, USA / September, 2000First ICU Developer Workshop38 Summary Language-sensitive Unicode collation in ICU Why is Unicode collation important What are the collation options Simple and extended example usage Collation element iterator usage Simple brute-force searching example Efficient Boyer-Moore searching and Unicode ICU and Java comparison

Cupertino, CA, USA / September, 2000First ICU Developer Workshop39 References ICU4C website ICU4J website ICU Workshop Information

Cupertino, CA, USA / September, 2000First ICU Developer Workshop40 Future Directions Further collation performance enhancements Upgrade to full Unicode collation algorithm Misc. collation features Boyer-Moore searching APIs

Cupertino, CA, USA / September, 2000First ICU Developer Workshop41 Collation Exercises (C and C++) Exercise 1: –Opens a collator with a locale. –Compares two strings with the collator. –Sets the strength to tertiary and compare the strings again. –Gets the keys for the strings and compare them. Exercise 2: –Open a Collator with customized rules and attributes. –Compare two strings with the collator. –Open a CollationElementIterator. –Walk through the text with the element iterator.