Download presentation
Presentation is loading. Please wait.
Published byCathleen Watts Modified over 8 years ago
1
Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1
2
Outline Part 1: Motivation and preliminaries Inverted list based algorithms Part 2: Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 2
3
Web Search http://www.google.com/jobs/britney.html Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google 3
4
Record Linkage R informix microsoft … … S infromix … mcrosoft … Record linkage Edit distance Jaccard Cosine … 4
5
Document Cleaning Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope Should be “Niels Bohr” 5
6
Demos http://directory.uci.edu/ http://psearch.ics.uci.edu/advanced/ http://psearch.ics.uci.edu/ 6
7
State-of-the-art: Oracle 10g and older Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; Limitation: cannot handle errors in the first letters: Katherine versus Catherine 7
8
8 Microsoft SQL Server [CGG+05] Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores 8
9
Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix filtering followed by a scan (Efficiency?) 9
10
Problem Formulation Find strings similar to a given string Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS
11
Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey 11
12
12 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 12
13
Outline Motivation and preliminaries Inverted list based algorithms List-merging algorithms VGRAM List-compression techniques Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 13
14
“ q-grams ” of strings u n i v e r s a l 2-grams 14
15
Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 15
16
q-gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 16
17
# of common grams >= 3 Searching using inverted lists Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 23 0 1 4 2 0 13 0124 4 124 3 3 ti iccksh ht ti ic ck 17
18
T-occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge 18
19
Example T = 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 15 19
20
List-Merging Algorithms HeapMergerMergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkipDivideSkip 20
21
Heap-based Algorithm Min-heap Count # of occurrences of each element using a heap Push to heap …… 21
22
MergeOpt Algorithm [SK04] Long Lists: T-1Short Lists Binary search 22
23
Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 23
24
24 ScanCount 123…123… 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 # of occurrences 0 0 0 4 1 Increment by 1 1 String ids 13 14 15 0 2 0 0 Result!
25
List-Merging Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip [SK04] [LLL08, BK02] 25
26
MergeSkip algorithm [BK02, LLL08] Min-heap …… Pop T-1 T-1 Jump Greater or equals 26
27
Example of MergeSkip 1 3 5 10 15 5757 1315 Count threshold T≥ 4 minHeap 10 1315 1 5 Jump 15 13 17 27
28
DivideSkip Algorithm [LLL08] Long ListsShort Lists Binary search MergeSkip 28
29
How many lists are treated as long lists? ? Short Lists Merge Long Lists Lookup A good balance in the tradeoff: # of long lists = T / ( μ logM +1) 29
30
Length Filtering Ed(s,t) ≤ 2 s: t: Length: 19 Length: 10 By length only! 30
31
Positional Filtering ab ab Ed(s,t) ≤ 2 s t (ab,1) (ab,12) 31
32
Filter tree [LLL08] … Length level Gram level Position level Inverted list 5 12 17 28 44 root 2 n1 3 … zyzz abaa 12m … 32
33
Surprising experimental results (DBLP) No filter (ms) Length (ms) Length+Pos (ms) DivideSkip2.230.76 1.96 Adding position filter could increase running time 33
34
Filters fragment inverts lists Applying filters Merge Saving: reduce list size. Cost: - Tree traversal, - More merging 34
35
Outline Motivation and preliminaries Inverted list based algorithms List-merging algorithms VGRAM [LWY07,YWL08] List-compression techniques Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 35
36
# of common grams >= 1 2-grams -> 3-grams? Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck tic icksht hti tic ick 4 2 4 1 2 0 1 0 3 4 1 3 4 2 3 id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 36
37
Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 37
38
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 38
39
VGRAM: Main idea Grams with variable lengths (between q min and q max ) zebra - ze(123) corrasion - co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms 39
40
Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 40
41
Challenge 1: String Variable-length grams? Fixed-length 2-grams Variable-length grams u n i v e r s a l ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l 41
42
Representing gram dictionary as a trie ni ivr sal uni vers 42
43
Challenge 2: Constructing gram dictionary st 0, 1, 3 sti 0, 1 stu 3 stic 0, 1 stuc 3 Step 1: Collecting frequencies of grams with length in [qmin, qmax] Gram trie with frequencies 43
44
Step 2: selecting grams Pruning trie using a frequency threshold F (e.g., 2) 44
45
Step 2: selecting grams (cont) Threshold T = 2 45
46
Final gram dictionary A cost-based approach to choosing a gram dictionary [YWL08] 46
47
Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 47
48
Deletion affects variable-length grams i-q max +1i+q max - 1 Deletion Not affected Affected i 48
49
Grams affected by a deletion ni ivr sal uni vers [2,4]-grams u n i v e r s a l i-q max +1i+q max - 1 Deletion i Affected? Deletion Affected? 49
50
Grams affected by a deletion (cont) i-q max +1i+q max - 1 Deletion i Affected? Trie of gramsTrie of reversed grams 50
51
# of grams affected by each operation _ u _ n _ i _ v _ e _ r _ s _ a _ l _ 01 11 12 12 22 11 12 11 1 10 Deletion/substitutionInsertion 51
52
Max # of grams affected by k operations Vector of s = With 2 edit operations, at most 4 grams can be affected Called NAG vector (# of affected grams) Precomputed and stored Dynamic programming to compute tight bounds [YWL08] 52
53
Summary of VGRAM index 53
54
Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min # of their common grams 54
55
Lower bound on # of common grams If ed(s1,s2) =: (|s 1 |- q + 1) – k * q u n i v e r s a l Fixed length (q) Variable lengths: # of grams of s1 – NAG(s1,k) 55
56
Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 124 12 1 0 4 3 … ck ic … ti … Lower bound = 3 Lower bound = 1 sh ht tick 24 1 4 0 1 1 2 3 … ck ic ich … tic tick … 2-4 grams2-grams tick id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 56
57
Outline Motivation and preliminaries Inverted list based algorithms List-merging algorithms VGRAM List-compression techniques [BJL+09] Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 57
58
Motivation Inverted index: very large IR : lossless compression (delta encoding, mostly disk-based) Decompression overhead Difficult to tune compression ratio 58
59
Solution Two lossy-compression techniques Queries become faster Flexibility to choose space / time tradeoff 59
60
Approach 1: Discarding Lists tfviirefrvneun in …… 2-grams Inverted Lists 5959 1515 7979 569569 1245612456 Discarded 60
61
Approach 2: Combining Lists 2-grams Inverted Lists tfviirefrvneun in …… 134579134579 569569 12391239 7979 6969 1245612456 Combined 61
62
Technical challenges Effect on list-merging algorithms How to choose lists to discard/merge 62
63
Outline (end of part 1) Part 1: Motivation and preliminaries Inverted list based algorithms List-merging algorithms VGRAM [LWY07,YWL08] List-compression techniques Part 2: Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 63
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.