Download presentation
Presentation is loading. Please wait.
Published byAmari Kidd Modified over 10 years ago
1
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1
2
DBLP Author Search http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2
3
Try their names (good luck!) http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3 Yannis PapakonstantinouMeral OzsoyogluMarios Hadjieleftheriou UCSD Case Western AT&T--Research
4
4
5
Better system? 5 http://dblp.ics.uci.edu/authors/
6
People Search at UC Irvine 6 http://psearch.ics.uci.edu/
7
Web Search http://www.google.com/jobs/britney.html Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google 7
8
Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8
9
Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9
10
Outline Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 10 Part I Part II
11
Preliminaries 11 Next…
12
Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey 12
13
13 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 13
14
State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; Limitation: cannot handle errors in the first letters: Katherine versus Catherine 14
15
15 Microsoft SQL Server [CGG+05] Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores 15
16
Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan (Efficiency?) 16
17
Trie-based approach [JLL+09] 17 Next…
18
Trie Indexing e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Strings exam example exemplar exempt sample 18
19
Active nodes on Trie e x a m p l $ $ e m p l a r $ t $ s a m p l e $e PrefixDistance examp2 exampl1 example0 exempl2 exempla2 sample2 Query: “example” Edit-distance threshold = 2 2 1 0 2 2 2 19
20
Initialization e x a m p l $ $ e m p l a r $ t $ s a m p l e $e Q = ε 0 11 22 PrefixDistance PrefixDistance 0 e1 ex2 s1 sa2 PrefixDistance ε0 Initial active nodes: all nodes within depth δ 20
21
Incremental Algorithm Return leaf nodes as answers. 21
22
e Q = e x a m p l e e x a m p l $ $ e m p l a r $ t $ s a m p l e $ PrefixDistance ε0 e1 ex2 s1 sa2 Prefix# OpBaseOp ε1εdel e s1εsub e/s e0εmat e ex1εins x exa2εIns xa exe2εIns xe Prefix# OpBaseOp Prefix# OpBaseOp ε1εdel e Prefix# OpBaseOp ε1εdel e s1εsub e/s Prefix# OpBaseOp ε1εdel e s1εsub e/s e0εmat e 1 10 1 22 e2edel e ex2esub e/xex3 del e exa3exsub e/a exe2exmat e s2sdel e sa2ssub e/asa3 del e Active nodes for Q = ε Active nodes for Q = e 2 22
23
23 Advantages: Trie size is small Can do search as the user types Disadvantages Works for edit distance only Good and bad 23
24
Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08] 24 Next…
25
“ q-grams ” of strings u n i v e r s a l 2-grams 25
26
Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 26 If ed(s1,s2) = (|s 1 |- q + 1) – k * q
27
q-gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 27
28
# of common grams >= 3 Searching using inverted lists Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 23 0 1 4 2 0 13 0124 4 124 3 3 ti iccksh ht ti ic ck 28
29
T-occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge 29
30
Example T = 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 15 30
31
List-Merging Algorithms HeapMergerMergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkipDivideSkip 31
32
Heap-based Algorithm Min-heap Count # of occurrences of each element using a heap Push to heap …… 32
33
MergeOpt Algorithm [SK04] Long Lists: T-1Short Lists Binary search 33
34
Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 34
35
35 ScanCount 123…123… 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 # of occurrences 0 0 0 4 1 Increment by 1 1 String ids 13 14 15 0 2 0 0 Result!
36
List-Merging Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip [SK04] [LLL08, BK02] 36
37
MergeSkip algorithm [BK02, LLL08] Min-heap …… Pop T-1 T-1 Jump Greater or equals 37
38
Example of MergeSkip 1 3 5 10 15 5757 1315 Count threshold T≥ 4 minHeap 10 1315 1 5 Jump 15 13 17 38
39
DivideSkip Algorithm [LLL08] Long ListsShort Lists Binary search MergeSkip 39
40
How many lists are treated as long lists? 40
41
Length Filtering Ed(s,t) ≤ 2 s: t: Length: 19 Length: 10 By length only! 41
42
Positional Filtering ab ab Ed(s,t) ≤ 2 s t (ab,1) (ab,12) 42
43
Normalized weights [HKS08] Compute a weight for each string L 0 : length of the string L 1, L 2 : Depend on q-gram frequencies Similar strings have similar weights A very strong pruning condition 43
44
Pruning using normalized weights Sort inverted lists based on string weights Search within a small weight range Shown to be effective (> 90% candidates pruned) 44
45
Variable-length grams (VGRAM) [LWY07,YWL08] 45 Next…
46
# of common grams >= 1 2-grams -> 3-grams? Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck tic icksht hti tic ick 4 2 4 1 2 0 1 0 3 4 1 3 4 2 3 id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 46
47
Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 47
48
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 48
49
VGRAM: Main idea Grams with variable lengths (between q min and q max ) zebra - ze(123) corrasion - co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms 49
50
Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 50
51
Challenge 1: String Variable-length grams? Fixed-length 2-grams Variable-length grams u n i v e r s a l ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l 51
52
Representing gram dictionary as a trie ni ivr sal uni vers 52
53
53 Step 2: Constructing a gram dictionary q min =2 q max =4 Frequency-based [LYW07] Cost-based [YLW08]
54
Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 54
55
Deletion affects variable-length grams i-q max +1i+q max - 1 Deletion Not affected Affected i 55
56
Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index 56 Vector of s = With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering
57
Summary of VGRAM index 57
58
Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min # of their common grams 58
59
Lower bound on # of common grams If ed(s1,s2) =: (|s 1 |- q + 1) – k * q u n i v e r s a l Fixed length (q) Variable lengths: # of grams of s1 – NAG(s1,k) 59
60
Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 124 12 1 0 4 3 … ck ic … ti … Lower bound = 3 Lower bound = 1 sh ht tick 24 1 4 0 1 1 2 3 … ck ic ich … tic tick … 2-4 grams2-grams tick id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 60
61
End of part I Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 61 Part I Part II
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.