Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

DBLP Author Search http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2

Try their names (good luck!) http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3 Yannis PapakonstantinouMeral OzsoyogluMarios Hadjieleftheriou UCSD Case Western AT&T--Research

Better system? 5 http://dblp.ics.uci.edu/authors/

People Search at UC Irvine 6 http://psearch.ics.uci.edu/

Web Search http://www.google.com/jobs/britney.html  Errors in queries  Errors in data  Bring query and meaningful results closer together Actual queries gathered by Google 7

Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8

Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9

Outline Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 10 Part I Part II

Preliminaries 11 Next…

Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey 12

13 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 13

State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; Limitation: cannot handle errors in the first letters: Katherine versus Catherine 14

15 Microsoft SQL Server [CGG+05] Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores 15

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan (Efficiency?) 16

Trie-based approach [JLL+09] 17 Next…

Trie Indexing e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Strings exam example exemplar exempt sample 18

Active nodes on Trie e x a m p l $ $ e m p l a r $ t $ s a m p l e $e PrefixDistance examp2 exampl1 example0 exempl2 exempla2 sample2 Query: “example” Edit-distance threshold = 2 2 1 0 2 2 2 19

Initialization e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Q = ε 0 11 22 PrefixDistance PrefixDistance 0 e1 ex2 s1 sa2 PrefixDistance ε0 Initial active nodes: all nodes within depth δ 20

Incremental Algorithm Return leaf nodes as answers. 21

e Q = e x a m p l e e x a m p l $ $ e m p l a r $ t $ s a m p l e $ PrefixDistance ε0 e1 ex2 s1 sa2 Prefix# OpBaseOp ε1εdel e s1εsub e/s e0εmat e ex1εins x exa2εIns xa exe2εIns xe Prefix# OpBaseOp Prefix# OpBaseOp ε1εdel e Prefix# OpBaseOp ε1εdel e s1εsub e/s Prefix# OpBaseOp ε1εdel e s1εsub e/s e0εmat e 1 10 1 22 e2edel e ex2esub e/xex3 del e exa3exsub e/a exe2exmat e s2sdel e sa2ssub e/asa3 del e Active nodes for Q = ε Active nodes for Q = e 2 22

23 Advantages: Trie size is small Can do search as the user types Disadvantages Works for edit distance only Good and bad 23

Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08] 24 Next…

“ q-grams ” of strings u n i v e r s a l 2-grams 25

Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 26 If ed(s1,s2) = (|s 1 |- q + 1) – k * q

q-gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 27

# of common grams >= 3 Searching using inverted lists Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 23 0 1 4 2 0 13 0124 4 124 3 3 ti iccksh ht ti ic ck 28

T-occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge 29

Example T = 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 15 30

List-Merging Algorithms HeapMergerMergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkipDivideSkip 31

Heap-based Algorithm Min-heap Count # of occurrences of each element using a heap Push to heap …… 32

MergeOpt Algorithm [SK04] Long Lists: T-1Short Lists Binary search 33

Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 34

35 ScanCount 123…123… 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 # of occurrences 0 0 0 4 1 Increment by 1 1 String ids 13 14 15 0 2 0 0 Result!

List-Merging Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip [SK04] [LLL08, BK02] 36

MergeSkip algorithm [BK02, LLL08] Min-heap …… Pop T-1 T-1 Jump Greater or equals 37

Example of MergeSkip 1 3 5 10 15 5757 1315 Count threshold T≥ 4 minHeap 10 1315 1 5 Jump 15 13 17 38

DivideSkip Algorithm [LLL08] Long ListsShort Lists Binary search MergeSkip 39

How many lists are treated as long lists? 40

Length Filtering Ed(s,t) ≤ 2 s: t: Length: 19 Length: 10 By length only! 41

Positional Filtering ab ab Ed(s,t) ≤ 2 s t (ab,1) (ab,12) 42

Normalized weights [HKS08] Compute a weight for each string L 0 : length of the string L 1, L 2 : Depend on q-gram frequencies Similar strings have similar weights A very strong pruning condition 43

Pruning using normalized weights Sort inverted lists based on string weights Search within a small weight range Shown to be effective (> 90% candidates pruned) 44

Variable-length grams (VGRAM) [LWY07,YWL08] 45 Next…

# of common grams >= 1 2-grams -> 3-grams? Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck tic icksht hti tic ick 4 2 4 1 2 0 1 0 3 4 1 3 4 2 3 id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 46

Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams  Shorter lists Smaller # of common grams of similar strings id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 47

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 48

VGRAM: Main idea Grams with variable lengths (between q min and q max ) zebra - ze(123) corrasion - co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms 49

Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 50

Challenge 1: String  Variable-length grams? Fixed-length 2-grams Variable-length grams u n i v e r s a l ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l 51

Representing gram dictionary as a trie ni ivr sal uni vers 52

53 Step 2: Constructing a gram dictionary q min =2 q max =4 Frequency-based [LYW07] Cost-based [YLW08]

Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 54

Deletion affects variable-length grams i-q max +1i+q max - 1 Deletion Not affected Affected i 55

Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index 56 Vector of s = With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering

Summary of VGRAM index 57

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s  grams String s1, s2 such that ed(s1,s2) <= k  min # of their common grams 58

Lower bound on # of common grams If ed(s1,s2) =: (|s 1 |- q + 1) – k * q u n i v e r s a l Fixed length (q) Variable lengths: # of grams of s1 – NAG(s1,k) 59

Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 124 12 1 0 4 3 … ck ic … ti … Lower bound = 3 Lower bound = 1 sh ht tick 24 1 4 0 1 1 2 3 … ck ic ich … tic tick … 2-4 grams2-grams tick id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 60

End of part I Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 61 Part I Part II

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.

Similar presentations

Presentation on theme: "Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.

Similar presentations

Presentation on theme: "Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1."— Presentation transcript:

Similar presentations

About project

Feedback