1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.

1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring 2015

22 Example: a movie database StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies starred Schwarrzenger.

33 Problem definition: approximate string searches … Schwarzenger Samuel Jackson Keanu Reeves Star Query q: Collection of strings s Search Output: strings s that satisfy Sim(q,s)≤ δ Sim functions: edit distance, Jaccard Coefficient and Cosine similarity Schwarrzenger

Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE 4

5 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 5

State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; Limitation: cannot handle errors in the first letters: Katherine versus Catherine 6

7 Microsoft SQL Server Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores 7

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan (Efficiency?) 8

99 Outline  Gram-based approaches  Trie-based approaches

10 String  Grams  q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) For example: 2-gram universal

11 Inverted lists  Convert strings to gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3

12 Main Example Query Merge DataGrams stick (st,ti,ic,ck) count >=2 idstrings 0rich 1stick 2stich 3stuck 4static ck ic st ta ti … 1,3 1,2,3,4 4 1,2,4 ed(s,q)≤1 0,1,2,4 Candidates

13 Problem definition: Find elements whose occurrences ≥ T Ascending order Merge

14 Example  T = 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 15

Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] ScanCount MergeSkipDivideSkip

16 Heap-based Algorithm Min-heap Count # of the occurrences of each element by a heap Push to heap ……

17 MergeOpt Algorithm Long Lists: T-1Short Lists Binary search

18 Example of MergeOpt [Sarawagi et al 2004] 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 Long Lists: 3 Short Lists: 2

19 Five Merge Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip

20 ScanCount Example 123…123… 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 # of occurrences 0 0 0 4 1 Increment by 1 1 String ids 13 14 15 0 2 0 0 Result!

22 MergeSkip algorithm Min-heap …… Pop T-1 T-1 Jump Greater or equals

23 Example of MergeSkip 1 3 5 10 15 5757 1315 Count threshold T≥ 4 minHeap 10 1315 1 5 Jump 15 13 17

24 Skip is safe Min-heap …… # of occurrences of skipped elements ≤ T-1 Skip

26 DivideSkip Algorithm Long ListsShort Lists Binary search MergeSkip

27 How many lists are treated as long lists? ? Short Lists Merge Long Lists Lookup

28 Performance (DBLP) DivideSkip is the best one

29 Trie-Based Approach

Trie Indexing e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Strings exam example exemplar exempt sample 30

Active nodes on Trie e x a m p l $ $ e m p l a r $ t $ s a m p l e $e PrefixDistance examp2 exampl1 example0 exempl2 exempla2 sample2 Query: “example” Edit-distance threshold = 2 2 1 0 2 2 2 31

Initialization e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Q = ε 0 11 22 PrefixDistance PrefixDistance 0 e1 ex2 s1 sa2 PrefixDistance ε0 Initial active nodes: all nodes within depth δ 32

Incremental Algorithm Return leaf nodes as answers. 33

34 Advantages: Trie size is small Can do search as the user types Disadvantages Works for edit distance only Good and bad 34

35 References 1. Efficient Merging and Filtering Algorithms for Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008 2. Efficient Interactive Fuzzy Keyword Search, Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng, WWW 2009

1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.

Similar presentations

Presentation on theme: "1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.

Similar presentations

Presentation on theme: "1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring."— Presentation transcript:

Similar presentations

About project

Feedback