Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1
DBLP Author Search 2
Try their names (good luck!) 3 Yannis PapakonstantinouMeral OzsoyogluMarios Hadjieleftheriou UCSD Case Western AT&T--Research
4
Better system? 5
People Search at UC Irvine 6
Web Search Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google 7
Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8
Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9
Outline Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 10 Part I Part II
Preliminaries 11 Next…
Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey 12
13 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 13
Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08] 14 Next…
“ q-grams ” of strings u n i v e r s a l 2-grams 15
Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 16 If ed(s1,s2) = (|s 1 |- q + 1) – k * q
q-gram inverted lists id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc
# of common grams >= 3 Searching using inverted lists Query: “ shtick ”, ED(shtick, ?)≤1 id strings rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc ti iccksh ht ti ic ck 18
T-occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge 19
Example T = 4 Result:
List-Merging Algorithms HeapMergerMergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkipDivideSkip 21
Heap-based Algorithm Min-heap Count # of occurrences of each element using a heap Push to heap …… 22
MergeOpt Algorithm [SK04] Long Lists: T-1Short Lists Binary search 23
Example of MergeOpt Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 24
25 ScanCount 123…123… Count threshold T≥ 4 # of occurrences Increment by 1 1 String ids Result!
List-Merging Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip [SK04] [LLL08, BK02] 26
MergeSkip algorithm [BK02, LLL08] Min-heap …… Pop T-1 T-1 Jump Greater or equals 27
Example of MergeSkip Count threshold T≥ 4 minHeap Jump
DivideSkip Algorithm [LLL08] Long ListsShort Lists Binary search MergeSkip 29
How many lists are treated as long lists? 30
Length Filtering Ed(s,t) ≤ 2 s: t: Length: 19 Length: 10 By length only! 31
Positional Filtering ab ab Ed(s,t) ≤ 2 s t (ab,1) (ab,12) 32
Combine filters with list-merging algorithms [LLL08] 33 A filter tree
Variable-length grams (VGRAM) [LWY07,YWL08] 34 Next…
# of common grams >= 1 2-grams -> 3-grams? Query: “ shtick ”, ED(shtick, ?)≤1 id strings rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck tic icksht hti tic ick id strings rich stick stich stuck static id strings rich stick stich stuck static 35
Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 37
VGRAM: Main idea Grams with variable lengths (between q min and q max ) zebra - ze(123) corrasion - co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms 38
Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 39
Challenge 1: String Variable-length grams? Fixed-length 2-grams Variable-length grams u n i v e r s a l ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l 40
Representing gram dictionary as a trie ni ivr sal uni vers 41
42 Step 2: Constructing a gram dictionary q min =2 q max =4 Frequency-based [LYW07] Cost-based [YLW08]
Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 43
Deletion affects variable-length grams i-q max +1i+q max - 1 Deletion Not affected Affected i 44
Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index 45 Vector of s = With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering
Summary of VGRAM index 46
Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min # of their common grams 47
Lower bound on # of common grams If ed(s1,s2) =: (|s 1 |- q + 1) – k * q u n i v e r s a l Fixed length (q) Variable lengths: # of grams of s1 – NAG(s1,k) 48
Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤ … ck ic … ti … Lower bound = 3 Lower bound = 1 sh ht tick … ck ic ich … tic tick … 2-4 grams2-grams tick id strings rich stick stich stuck static id strings rich stick stich stuck static id strings rich stick stich stuck static 49
End of part I Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 50 Part I Part II
References [AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav Kaushik.VLDB 2006 [ACGK08] Incorporating string transformations in record matching. Arvind Arasu, Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008 [BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire Kenyon. SODA 2002 [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009 [BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher. STOC 1998 [CGG+05]Data cleaning in microsoft SQL server Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis. SIGMOD 2005 [CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06 [CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08 [HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE
References [HYK+08] Hashed samples: selectivity estimators for set similarity selection queries. Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang Jin, Chen Li. VLDB [JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng. WWW 2009 [JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08 [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita Sarawagi, Divesh Srivastava. SIGMOD [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches. Chen Li, Jiaheng Lu, and Yiming Lu. ICDE [LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007 [LWY07] VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007 [MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS
References [SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal. SIGMOD 2004 [XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008 [XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008 [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li. SIGMOD