1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.

Slides:



Advertisements
Similar presentations
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Advertisements

Jiaheng Lu, University of California, Irvine
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
Indexing DNA Sequences Using q-Grams
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
Chen Li ( 李晨 ) Chen Li Search As You Type Joint work with colleagues at UCI and Tsinghua.
Efficient Interactive Fuzzy Keyword Search Shengyue Ji 1, Guoliang Li 2, Chen Li 1, Jianhua Feng 2 1 University of California, Irvine 2 Tsinghua University.
Chen Li ( 李晨 ) Chen Li Scalable Interactive Search NFIC August 14, 2010, San Jose, CA Joint work with colleagues at UC Irvine and Tsinghua University.
1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.
Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Speaker: Sattam Alsubaiee Supporting Location-Based Approximate-Keyword Queries Sattam Alsubaiee, Alexander Behm, and Chen Li University of California,
Efficient Type-Ahead Search on Relational Data: a TASTIER Approach Guoliang Li 1, Shengyue Ji 2, Chen Li 2, Jianhua Feng 1 1 Tsinghua University, Beijing,
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Review of Claremont Report on Database Research Jiaheng Lu Renmin University of China.
Text Search and Fuzzy Matching
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.
VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Oracle 8i interMedia Text Presented by Jorge Rimblas 4-Feb-2002 SSI Worldwide.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.
Efficient Approximate Search on String Collections Part I
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
CS122B: Projects in Databases and Web Applications Spring 2017
CS122B: Projects in Databases and Web Applications Winter 2017
Top-k String Similarity Search with Edit-Distance Constraints
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)
Text Joins in an RDBMS for Web Data Integration
Weighted Exact Set Similarity Join
CS122B: Projects in Databases and Web Applications Winter 2018
Supporting of search-as-you-type using sql in databases
Efficient Record Linkage in Large Data Sets
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
CS122B: Projects in Databases and Web Applications Winter 2019
CS122B: Projects in Databases and Web Applications Spring 2018
CS122B: Projects in Databases and Web Applications Spring 2018
Minwise Hashing and Efficient Search
CS122B: Projects in Databases and Web Applications Winter 2018
CS122B: Projects in Databases and Web Applications Winter 2019
CS122B: Projects in Databases and Web Applications Winter 2019
CS122B: Projects in Databases and Web Applications Winter 2018
An Efficient Partition Based Method for Exact Set Similarity Joins
CS122B: Projects in Databases and Web Applications Spring 2018
CS122B: Projects in Databases and Web Applications Winter 2018
Presentation transcript:

1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring 2015

22 Example: a movie database StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies starred Schwarrzenger.

33 Problem definition: approximate string searches … Schwarzenger Samuel Jackson Keanu Reeves Star Query q: Collection of strings s Search Output: strings s that satisfy Sim(q,s)≤ δ Sim functions: edit distance, Jaccard Coefficient and Cosine similarity Schwarrzenger

Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE 4

5 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 5

State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; Limitation: cannot handle errors in the first letters: Katherine versus Catherine 6

7 Microsoft SQL Server Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores 7

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan (Efficiency?) 8

99 Outline  Gram-based approaches  Trie-based approaches

10 String  Grams  q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) For example: 2-gram universal

11 Inverted lists  Convert strings to gram inverted lists id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc

12 Main Example Query Merge DataGrams stick (st,ti,ic,ck) count >=2 idstrings 0rich 1stick 2stich 3stuck 4static ck ic st ta ti … 1,3 1,2,3,4 4 1,2,4 ed(s,q)≤1 0,1,2,4 Candidates

13 Problem definition: Find elements whose occurrences ≥ T Ascending order Merge

14 Example  T = 4 Result:

Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] ScanCount MergeSkipDivideSkip

16 Heap-based Algorithm Min-heap Count # of the occurrences of each element by a heap Push to heap ……

17 MergeOpt Algorithm Long Lists: T-1Short Lists Binary search

18 Example of MergeOpt [Sarawagi et al 2004] Count threshold T≥ 4 Long Lists: 3 Short Lists: 2

19 Five Merge Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip

20 ScanCount Example 123…123… Count threshold T≥ 4 # of occurrences Increment by 1 1 String ids Result!

21 Five Merge Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip

22 MergeSkip algorithm Min-heap …… Pop T-1 T-1 Jump Greater or equals

23 Example of MergeSkip Count threshold T≥ 4 minHeap Jump

24 Skip is safe Min-heap …… # of occurrences of skipped elements ≤ T-1 Skip

25 Five Merge Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip

26 DivideSkip Algorithm Long ListsShort Lists Binary search MergeSkip

27 How many lists are treated as long lists? ? Short Lists Merge Long Lists Lookup

28 Performance (DBLP) DivideSkip is the best one

29 Trie-Based Approach

Trie Indexing e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Strings exam example exemplar exempt sample 30

Active nodes on Trie e x a m p l $ $ e m p l a r $ t $ s a m p l e $e PrefixDistance examp2 exampl1 example0 exempl2 exempla2 sample2 Query: “example” Edit-distance threshold =

Initialization e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Q = ε PrefixDistance PrefixDistance 0 e1 ex2 s1 sa2 PrefixDistance ε0 Initial active nodes: all nodes within depth δ 32

Incremental Algorithm Return leaf nodes as answers. 33

34 Advantages: Trie size is small Can do search as the user types Disadvantages Works for edit distance only Good and bad 34

35 References 1. Efficient Merging and Filtering Algorithms for Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE Efficient Interactive Fuzzy Keyword Search, Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng, WWW 2009