Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li
DBLP Author Search http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html
Try their names (good luck!) Case Western AT&T--Research UCSD Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html
Better system? http://dblp.ics.uci.edu/authors/
People Search at UC Irvine http://psearch.ics.uci.edu/
Web Search Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html 7
Data Cleaning R S informix microsoft … infromix … mcrosoft
Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! 10 ms: 100 queries per second (QPS) 5 ms: 200 QPS
Outline Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II
Next… Preliminaries
Similarity Functions Similar to: Examples: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey
Edit Distance A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 13 13
Gram-based algorithms Next… Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08]
“q-grams” of strings u n i v e r s a l 2-grams
Edit operation’s effect on grams Fixed length: q u n i v e r s a l k operations could affect k * q grams If ed(s1,s2) <= k, then their # of common grams >= (|s1|- q + 1) – k * q 16
q-gram inverted lists at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static
Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ti ic ck # of common grams >= 3 at ch ck ic ri st ta ti tu uc 4 2 3 1 id strings 1 2 3 4 rich stick stich stuck static 2-grams
Find elements whose occurrences ≥ T T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T
Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13
List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip
Count # of occurrences of each element using a heap Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap
MergeOpt Algorithm [SK04] Binary search Long Lists: T-1 Short Lists
Example of MergeOpt Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4
ScanCount Count threshold T≥ 4 1 2 3 … 1 1 3 5 10 13 10 13 15 5 7 13 String ids # of occurrences Increment by 1 1 2 3 … 1 1 3 5 10 13 10 13 15 5 7 13 13 15 1 13 4 Result! 14 15 2 Count threshold T≥ 4 25
List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip
MergeSkip algorithm [BK02, LLL08] Pop T-1 …… Min-heap Jump Greater or equals T-1
Example of MergeSkip Count threshold T≥ 4 minHeap Jump 1 5 10 13 15 1 7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4
DivideSkip Algorithm [LLL08] Binary search MergeSkip Long Lists Short Lists
How many lists are treated as long lists?
Length Filtering s: t: Length: 10 By length only! Ed(s,t) ≤ 2
Positional Filtering Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12)
A filter tree Combine filters with list-merging algorithms [LLL08]
Variable-length grams (VGRAM) [LWY07,YWL08] Next… Variable-length grams (VGRAM) [LWY07,YWL08]
2-grams -> 3-grams? sht hti tic ick tic ick Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick tic ick # of common grams >= 1 ati ich ick ric sta sti stu tat tic tuc uck 4 2 1 3 id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static 3-grams
Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings 4 2 3 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio
VGRAM: Main idea Grams with variable lengths (between qmin and qmax) zebra ze(123) corrasion co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms
Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms?
Challenge 1: String Variable-length grams? Fixed-length 2-grams u n i v e r s a l Variable-length grams ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l
Representing gram dictionary as a trie ni ivr sal uni vers
Step 2: Constructing a gram dictionary qmin=2 qmax=4 Frequency-based [LYW07] Cost-based [YLW08]
Challenge 3: Edit operation’s effect on grams Fixed length: q u n i v e r s a l k operations could affect k * q grams
Deletion affects variable-length grams Not affected Not affected Affected i-qmax+1 i i+qmax- 1 Deletion
With 2 edit operations, at most 4 grams can be affected Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index Vector of s = <2,4,6,8,9> With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering
Summary of VGRAM index
Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min # of their common grams
Lower bound on # of common grams Fixed length (q) u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >=: (|s1|- q + 1) – k * q Variable lengths: # of grams of s1 – NAG(s1,k)
Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh ht tick tick 2-grams 2-4 grams 2 4 1 3 … ck ic ich tic tick 1 2 4 3 … ck ic ti Lower bound = 3 id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static Lower bound = 1
End of part I Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II