Download presentation
Presentation is loading. Please wait.
Published byColeen Holmes Modified over 9 years ago
1
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1
2
DBLP Author Search http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2
3
Try their names (good luck!) http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3 Yannis PapakonstantinouMeral OzsoyogluMarios Hadjieleftheriou UCSD Case Western AT&T--Research
4
4
5
Better system? 5 http://dblp.ics.uci.edu/authors/
6
People Search at UC Irvine 6 http://psearch.ics.uci.edu/
7
Web Search http://www.google.com/jobs/britney.html Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google 7
8
Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8
9
Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9
10
Outline Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 10 Part I Part II
11
Preliminaries 11 Next…
12
Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey 12
13
13 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 13
14
Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08] 14 Next…
15
“ q-grams ” of strings u n i v e r s a l 2-grams 15
16
Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 16 If ed(s1,s2) = (|s 1 |- q + 1) – k * q
17
q-gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 17
18
# of common grams >= 3 Searching using inverted lists Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 23 0 1 4 2 0 13 0124 4 124 3 3 ti iccksh ht ti ic ck 18
19
T-occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge 19
20
Example T = 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 15 20
21
List-Merging Algorithms HeapMergerMergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkipDivideSkip 21
22
Heap-based Algorithm Min-heap Count # of occurrences of each element using a heap Push to heap …… 22
23
MergeOpt Algorithm [SK04] Long Lists: T-1Short Lists Binary search 23
24
Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 24
25
25 ScanCount 123…123… 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 # of occurrences 0 0 0 4 1 Increment by 1 1 String ids 13 14 15 0 2 0 0 Result!
26
List-Merging Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip [SK04] [LLL08, BK02] 26
27
MergeSkip algorithm [BK02, LLL08] Min-heap …… Pop T-1 T-1 Jump Greater or equals 27
28
Example of MergeSkip 1 3 5 10 15 5757 1315 Count threshold T≥ 4 minHeap 10 1315 1 5 Jump 15 13 17 28
29
DivideSkip Algorithm [LLL08] Long ListsShort Lists Binary search MergeSkip 29
30
How many lists are treated as long lists? 30
31
Length Filtering Ed(s,t) ≤ 2 s: t: Length: 19 Length: 10 By length only! 31
32
Positional Filtering ab ab Ed(s,t) ≤ 2 s t (ab,1) (ab,12) 32
33
Combine filters with list-merging algorithms [LLL08] 33 A filter tree
34
Variable-length grams (VGRAM) [LWY07,YWL08] 34 Next…
35
# of common grams >= 1 2-grams -> 3-grams? Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck tic icksht hti tic ick 4 2 4 1 2 0 1 0 3 4 1 3 4 2 3 id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 35
36
Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 36
37
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 37
38
VGRAM: Main idea Grams with variable lengths (between q min and q max ) zebra - ze(123) corrasion - co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms 38
39
Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 39
40
Challenge 1: String Variable-length grams? Fixed-length 2-grams Variable-length grams u n i v e r s a l ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l 40
41
Representing gram dictionary as a trie ni ivr sal uni vers 41
42
42 Step 2: Constructing a gram dictionary q min =2 q max =4 Frequency-based [LYW07] Cost-based [YLW08]
43
Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 43
44
Deletion affects variable-length grams i-q max +1i+q max - 1 Deletion Not affected Affected i 44
45
Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index 45 Vector of s = With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering
46
Summary of VGRAM index 46
47
Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min # of their common grams 47
48
Lower bound on # of common grams If ed(s1,s2) =: (|s 1 |- q + 1) – k * q u n i v e r s a l Fixed length (q) Variable lengths: # of grams of s1 – NAG(s1,k) 48
49
Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 124 12 1 0 4 3 … ck ic … ti … Lower bound = 3 Lower bound = 1 sh ht tick 24 1 4 0 1 1 2 3 … ck ic ich … tic tick … 2-4 grams2-grams tick id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 49
50
End of part I Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 50 Part I Part II
51
References [AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav Kaushik.VLDB 2006 [ACGK08] Incorporating string transformations in record matching. Arvind Arasu, Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008 [BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire Kenyon. SODA 2002 [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009 [BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher. STOC 1998 [CGG+05]Data cleaning in microsoft SQL server 2005. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis. SIGMOD 2005 [CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06 [CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08 [HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE 2008 51
52
References [HYK+08] Hashed samples: selectivity estimators for set similarity selection queries. Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB 2008. [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang Jin, Chen Li. VLDB 2005. [JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng. WWW 2009 [JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08 [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita Sarawagi, Divesh Srivastava. SIGMOD 2006. [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches. Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008. [LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007 [LWY07] VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007 [MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS 2007 52
53
References [SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal. SIGMOD 2004 [XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008 [XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008 [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li. SIGMOD 2008 53
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.