Download presentation
Presentation is loading. Please wait.
Published byJesus Duffy Modified over 11 years ago
1
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu
2
Chen Li, Jiaheng Lu, Yiming Lu 22 Example: a movie database StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies starred Schwarrzenger.
3
Chen Li, Jiaheng Lu, Yiming Lu 33 Data may not clean Star Keanu Reeves Samuel Jackson Schwarzenegger Relation RRelation S Data integration and cleaning: Star Keanu Reeves Samuel L. Jackson Schwarzenegger
4
Chen Li, Jiaheng Lu, Yiming Lu 44 Problem definition: approximate string searches … Schwarzenger Samuel Jackson Keanu Reeves Star Query q: Collection of strings s Search Output: strings s that satisfy Sim(q,s) δ Sim functions: edit distance, Jaccard Coefficient and Cosine similarity Schwarrzenger
5
Chen Li, Jiaheng Lu, Yiming Lu 55 Outline Problem motivation Preliminaries Grams Inverted lists Merge algorithms Filtering techniques Conclusion
6
Chen Li, Jiaheng Lu, Yiming Lu 66 String Grams q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) For example: 2-gram universal
7
Chen Li, Jiaheng Lu, Yiming Lu 77 Inverted lists Convert strings to gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3
8
Chen Li, Jiaheng Lu, Yiming Lu 88 Main Example Query Merge DataGrams stick (st,ti,ic,ck) count >=2 idstrings 0rich 1stick 2stich 3stuck 4static ck ic st ta ti … 1,3 1,2,3,4 4 1,2,4 ed(s,q)1 0,1,2,4 Candidates
9
Chen Li, Jiaheng Lu, Yiming Lu 99 Problem definition: Find elements whose occurrences T Ascending order Merge
10
Chen Li, Jiaheng Lu, Yiming Lu 10 Example T = 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 15
11
Chen Li, Jiaheng Lu, Yiming Lu 11 Contributions Three new merge algorithms New finding: wisely using filters
12
Chen Li, Jiaheng Lu, Yiming Lu 12 Outline Problem motivation Preliminaries Merge algorithms Two previous algorithms Our proposed three algorithms Filtering techniques Conclusion
13
Chen Li, Jiaheng Lu, Yiming Lu 13 Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkipDivideSkip
14
Chen Li, Jiaheng Lu, Yiming Lu 14 Heap-based Algorithm Min-heap Count # of the occurrences of each element by a heap Push to heap ……
15
Chen Li, Jiaheng Lu, Yiming Lu 15 MergeOpt Algorithm Long Lists: T-1Short Lists Binary search
16
Chen Li, Jiaheng Lu, Yiming Lu 16 Example of MergeOpt [Sarawagi et al 2004] 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T 4 Long Lists: 3 Short Lists: 2
17
Chen Li, Jiaheng Lu, Yiming Lu 17 Can we run faster?
18
Chen Li, Jiaheng Lu, Yiming Lu 18 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip
19
Chen Li, Jiaheng Lu, Yiming Lu 19 ScanCount Example 123…123… 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T 4 # of occurrences 0 0 0 4 1 Increment by 1 1 String ids 13 14 15 0 2 0 0 Result!
20
Chen Li, Jiaheng Lu, Yiming Lu 20 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip
21
Chen Li, Jiaheng Lu, Yiming Lu 21 MergeSkip algorithm Min-heap …… Pop T-1 T-1 Jump Greater or equals
22
Chen Li, Jiaheng Lu, Yiming Lu 22 Example of MergeSkip 1 3 5 10 15 5757 1315 Count threshold T 4 minHeap 10 1315 1 5 Jump 15 13 17
23
Chen Li, Jiaheng Lu, Yiming Lu 23 Skip is safe Min-heap …… # of occurrences of skipped elements T-1 Skip
24
Chen Li, Jiaheng Lu, Yiming Lu 24 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip
25
Chen Li, Jiaheng Lu, Yiming Lu 25 DivideSkip Algorithm Long ListsShort Lists Binary search MergeSkip
26
Chen Li, Jiaheng Lu, Yiming Lu 26 How many lists are treated as long lists? ? Short Lists Merge Long Lists Lookup
27
Chen Li, Jiaheng Lu, Yiming Lu 27 Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1)
28
Chen Li, Jiaheng Lu, Yiming Lu 28 Experimental data sets DBLP dataIMDB dataGoogle Web corpus
29
Chen Li, Jiaheng Lu, Yiming Lu 29 Performance (DBLP) DivideSkip is the best one
30
Chen Li, Jiaheng Lu, Yiming Lu 30 # of access elements (DBLP) DivideSkip is the best one
31
Chen Li, Jiaheng Lu, Yiming Lu 31 Outline Problem motivation Preliminaries Merge algorithms Filtering techniques Length, positional filters Filter tree Conclusion and future work
32
Chen Li, Jiaheng Lu, Yiming Lu 32 Length Filtering Ed(s,t) 2 s: t: Length: 19 Length: 10 By length only!
33
Chen Li, Jiaheng Lu, Yiming Lu 33 Positional Filtering ab ab Ed(s,t) 2 s t (ab,1) (ab,12)
34
Chen Li, Jiaheng Lu, Yiming Lu 34 Filter tree … Length level Gram level Position level Inverted list 5 12 17 28 44 root 2 n1 3 … zyzz abaa 12m …
35
Chen Li, Jiaheng Lu, Yiming Lu 35 Surprising experimental results (DBLP) No filter (ms) Length (ms) Length+Pos (ms) DivideSkip2.230.76 1.96 Why adding position filter increases the running time?
36
Chen Li, Jiaheng Lu, Yiming Lu 36 Filters fragment inverts lists Applying filters Merge Cost: (1)Tree traversal (2)More merging Saving: reduce total lists size
37
Chen Li, Jiaheng Lu, Yiming Lu 37 Conclusion Three new merge algorithms We run faster Interesting finding: Do not abuse filters!
38
Chen Li, Jiaheng Lu, Yiming Lu 38 Related work Approximate string matching [Navarro 2001] Varied length Grams [Li et al 2007] Fuzzy lookup in
39
Chen Li, Jiaheng Lu, Yiming Lu 39 References 1. [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik Efficient Exact Set-similarity Joins in VLDB 2006 2. [Chaudhuri 2003] S. Chaudhuri,K Ganjam, V. Ganti and R. Motwani Robust and Efficient Fuzzy Match for online Data Cleaning in SIGMOD 2003 3. [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava Approximate string joins in a database almost for free in VLDB 2001
40
Chen Li, Jiaheng Lu, Yiming Lu 40 References 4. [Li 2007] C. Li, B Wang and X. Yang VGRAM:Improving performance of approximate queries on string collections using variable- length grams in VLDB 2007 5. [Navarro 2001] G. Navarro, A guided tour to approximate string matching in Computing survey 2001 6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, Efficient set joins on similarity predicates in ACM SIGMOD 2004
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.