Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

Scenario: EOT/EOT-R/COT Resident admitted March 10th Admitted for PT and OT following knee replacement for patient with CHF, COPD, shortness of breath.

Simplifications of Context-Free Grammars

Variations of the Turing Machine

AP STUDY SESSION 2.

Select from the most commonly used minutes below.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.

Jiaheng Lu, University of California, Irvine

David Burdett May 11, 2004 Package Binding for WS CDL.

Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.

Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.

1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.

Media-Monitoring Final Report April - May 2010 News.

Break Time Remaining 10:00.

Turing Machines.

Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.

PP Test Review Sections 6-1 to 6-6

Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.

Regression with Panel Data

1 The Royal Doulton Company The Royal Doulton Company is an English company producing tableware and collectables, dating to Operating originally.

Association Rule Mining

Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

BEEF & VEAL MARKET SITUATION "Single CMO" Management Committee 22 November 2012.

TESOL International Convention Presentation- ESL Instruction: Developing Your Skills to Become a Master Conductor by Beth Clifton Crumpler by.

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

Adding Up In Chunks.

SLP – Endless Possibilities What can SLP do for your school? Everything you need to know about SLP – past, present and future.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

Artificial Intelligence

Before Between After.

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.

Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.

1 Titre de la diapositive SDMO Industries – Training Département MICS KERYS 09- MICS KERYS – WEBSITE.

Essential Cell Biology

Converting a Fraction to %

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

CSE20 Lecture 15 Karnaugh Maps Professor CK Cheng CSE Dept. UC San Diego 1.

Clock will move after 1 minute

PSSA Preparation.

Immunobiology: The Immune System in Health & Disease Sixth Edition

Physics for Scientists & Engineers, 3rd Edition

Energy Generation in Mitochondria and Chlorplasts

Chen Li ( 李晨 ) Chen Li Search As You Type Joint work with colleagues at UCI and Tsinghua.

Efficient Interactive Fuzzy Keyword Search Shengyue Ji 1, Guoliang Li 2, Chen Li 1, Jianhua Feng 2 1 University of California, Irvine 2 Tsinghua University.

Chen Li ( 李晨 ) Chen Li Scalable Interactive Search NFIC August 14, 2010, San Jose, CA Joint work with colleagues at UC Irvine and Tsinghua University.

Select a time to count down from the clock above

Copyright Tim Morris/St Stephen's School

1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple

1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,

Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.

VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.

Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.

Efficient Approximate Search on String Collections Part I

Presentation transcript:

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

DBLP Author Search 2

Try their names (good luck!) 3 Yannis PapakonstantinouMeral OzsoyogluMarios Hadjieleftheriou UCSD Case Western AT&T--Research

 4

Better system? 5

People Search at UC Irvine 6

Web Search  Errors in queries  Errors in data  Bring query and meaningful results closer together Actual queries gathered by Google 7

Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8

Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9

Outline Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 10 Part I Part II

Preliminaries 11 Next…

Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey 12

13 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 13

State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; Limitation: cannot handle errors in the first letters: Katherine versus Catherine 14

15 Microsoft SQL Server [CGG+05] Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores 15

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan (Efficiency?) 16

Trie-based approach [JLL+09] 17 Next…

Trie Indexing e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Strings exam example exemplar exempt sample 18

Active nodes on Trie e x a m p l $ $ e m p l a r $ t $ s a m p l e $e PrefixDistance examp2 exampl1 example0 exempl2 exempla2 sample2 Query: “example” Edit-distance threshold =

Initialization e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Q = ε PrefixDistance PrefixDistance 0 e1 ex2 s1 sa2 PrefixDistance ε0 Initial active nodes: all nodes within depth δ 20

Incremental Algorithm Return leaf nodes as answers. 21

e Q = e x a m p l e e x a m p l $ $ e m p l a r $ t $ s a m p l e $ PrefixDistance ε0 e1 ex2 s1 sa2 Prefix# OpBaseOp ε1εdel e s1εsub e/s e0εmat e ex1εins x exa2εIns xa exe2εIns xe Prefix# OpBaseOp Prefix# OpBaseOp ε1εdel e Prefix# OpBaseOp ε1εdel e s1εsub e/s Prefix# OpBaseOp ε1εdel e s1εsub e/s e0εmat e e2edel e ex2esub e/xex3 del e exa3exsub e/a exe2exmat e s2sdel e sa2ssub e/asa3 del e Active nodes for Q = ε Active nodes for Q = e 2 22

23 Advantages: Trie size is small Can do search as the user types Disadvantages Works for edit distance only Good and bad 23

Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08] 24 Next…

“ q-grams ” of strings u n i v e r s a l 2-grams 25

Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 26 If ed(s1,s2) = (|s 1 |- q + 1) – k * q

q-gram inverted lists id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc

# of common grams >= 3 Searching using inverted lists Query: “ shtick ”, ED(shtick, ?)≤1 id strings rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc ti iccksh ht ti ic ck 28

T-occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge 29

Example T = 4 Result:

List-Merging Algorithms HeapMergerMergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkipDivideSkip 31

Heap-based Algorithm Min-heap Count # of occurrences of each element using a heap Push to heap …… 32

MergeOpt Algorithm [SK04] Long Lists: T-1Short Lists Binary search 33

Example of MergeOpt Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 34

35 ScanCount 123…123… Count threshold T≥ 4 # of occurrences Increment by 1 1 String ids Result!

List-Merging Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip [SK04] [LLL08, BK02] 36

MergeSkip algorithm [BK02, LLL08] Min-heap …… Pop T-1 T-1 Jump Greater or equals 37

Example of MergeSkip Count threshold T≥ 4 minHeap Jump

DivideSkip Algorithm [LLL08] Long ListsShort Lists Binary search MergeSkip 39

How many lists are treated as long lists? 40

Length Filtering Ed(s,t) ≤ 2 s: t: Length: 19 Length: 10 By length only! 41

Positional Filtering ab ab Ed(s,t) ≤ 2 s t (ab,1) (ab,12) 42

Normalized weights [HKS08] Compute a weight for each string L 0 : length of the string L 1, L 2 : Depend on q-gram frequencies Similar strings have similar weights A very strong pruning condition 43

Pruning using normalized weights Sort inverted lists based on string weights Search within a small weight range Shown to be effective (> 90% candidates pruned) 44

Variable-length grams (VGRAM) [LWY07,YWL08] 45 Next…

# of common grams >= 1 2-grams -> 3-grams? Query: “ shtick ”, ED(shtick, ?)≤1 id strings rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck tic icksht hti tic ick id strings rich stick stich stuck static id strings rich stick stich stuck static 46

Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams  Shorter lists Smaller # of common grams of similar strings id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 48

VGRAM: Main idea Grams with variable lengths (between q min and q max ) zebra - ze(123) corrasion - co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms 49

Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 50

Challenge 1: String  Variable-length grams? Fixed-length 2-grams Variable-length grams u n i v e r s a l ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l 51

Representing gram dictionary as a trie ni ivr sal uni vers 52

53 Step 2: Constructing a gram dictionary q min =2 q max =4 Frequency-based [LYW07] Cost-based [YLW08]

Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 54

Deletion affects variable-length grams i-q max +1i+q max - 1 Deletion Not affected Affected i 55

Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index 56 Vector of s = With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering

Summary of VGRAM index 57

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s  grams String s1, s2 such that ed(s1,s2) <= k  min # of their common grams 58

Lower bound on # of common grams If ed(s1,s2) =: (|s 1 |- q + 1) – k * q u n i v e r s a l Fixed length (q) Variable lengths: # of grams of s1 – NAG(s1,k) 59

Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤ … ck ic … ti … Lower bound = 3 Lower bound = 1 sh ht tick … ck ic ich … tic tick … 2-4 grams2-grams tick id strings rich stick stich stuck static id strings rich stick stich stuck static id strings rich stick stich stuck static 60

End of part I Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 61 Part I Part II