1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu
Chen Li, Jiaheng Lu, Yiming Lu 22 Example: a movie database StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies starred Schwarrzenger.
Chen Li, Jiaheng Lu, Yiming Lu 33 Data may not clean Star Keanu Reeves Samuel Jackson Schwarzenegger Relation RRelation S Data integration and cleaning: Star Keanu Reeves Samuel L. Jackson Schwarzenegger
Chen Li, Jiaheng Lu, Yiming Lu 44 Problem definition: approximate string searches … Schwarzenger Samuel Jackson Keanu Reeves Star Query q: Collection of strings s Search Output: strings s that satisfy Sim(q,s) δ Sim functions: edit distance, Jaccard Coefficient and Cosine similarity Schwarrzenger
Chen Li, Jiaheng Lu, Yiming Lu 55 Outline Problem motivation Preliminaries Grams Inverted lists Merge algorithms Filtering techniques Conclusion
Chen Li, Jiaheng Lu, Yiming Lu 66 String Grams q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) For example: 2-gram universal
Chen Li, Jiaheng Lu, Yiming Lu 77 Inverted lists Convert strings to gram inverted lists id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc
Chen Li, Jiaheng Lu, Yiming Lu 88 Main Example Query Merge DataGrams stick (st,ti,ic,ck) count >=2 idstrings 0rich 1stick 2stich 3stuck 4static ck ic st ta ti … 1,3 1,2,3,4 4 1,2,4 ed(s,q)1 0,1,2,4 Candidates
Chen Li, Jiaheng Lu, Yiming Lu 99 Problem definition: Find elements whose occurrences T Ascending order Merge
Chen Li, Jiaheng Lu, Yiming Lu 10 Example T = 4 Result:
Chen Li, Jiaheng Lu, Yiming Lu 11 Contributions Three new merge algorithms New finding: wisely using filters
Chen Li, Jiaheng Lu, Yiming Lu 12 Outline Problem motivation Preliminaries Merge algorithms Two previous algorithms Our proposed three algorithms Filtering techniques Conclusion
Chen Li, Jiaheng Lu, Yiming Lu 13 Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkipDivideSkip
Chen Li, Jiaheng Lu, Yiming Lu 14 Heap-based Algorithm Min-heap Count # of the occurrences of each element by a heap Push to heap ……
Chen Li, Jiaheng Lu, Yiming Lu 15 MergeOpt Algorithm Long Lists: T-1Short Lists Binary search
Chen Li, Jiaheng Lu, Yiming Lu 16 Example of MergeOpt [Sarawagi et al 2004] Count threshold T 4 Long Lists: 3 Short Lists: 2
Chen Li, Jiaheng Lu, Yiming Lu 17 Can we run faster?
Chen Li, Jiaheng Lu, Yiming Lu 18 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip
Chen Li, Jiaheng Lu, Yiming Lu 19 ScanCount Example 123…123… Count threshold T 4 # of occurrences Increment by 1 1 String ids Result!
Chen Li, Jiaheng Lu, Yiming Lu 20 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip
Chen Li, Jiaheng Lu, Yiming Lu 21 MergeSkip algorithm Min-heap …… Pop T-1 T-1 Jump Greater or equals
Chen Li, Jiaheng Lu, Yiming Lu 22 Example of MergeSkip Count threshold T 4 minHeap Jump
Chen Li, Jiaheng Lu, Yiming Lu 23 Skip is safe Min-heap …… # of occurrences of skipped elements T-1 Skip
Chen Li, Jiaheng Lu, Yiming Lu 24 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip
Chen Li, Jiaheng Lu, Yiming Lu 25 DivideSkip Algorithm Long ListsShort Lists Binary search MergeSkip
Chen Li, Jiaheng Lu, Yiming Lu 26 How many lists are treated as long lists? ? Short Lists Merge Long Lists Lookup
Chen Li, Jiaheng Lu, Yiming Lu 27 Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1)
Chen Li, Jiaheng Lu, Yiming Lu 28 Experimental data sets DBLP dataIMDB dataGoogle Web corpus
Chen Li, Jiaheng Lu, Yiming Lu 29 Performance (DBLP) DivideSkip is the best one
Chen Li, Jiaheng Lu, Yiming Lu 30 # of access elements (DBLP) DivideSkip is the best one
Chen Li, Jiaheng Lu, Yiming Lu 31 Outline Problem motivation Preliminaries Merge algorithms Filtering techniques Length, positional filters Filter tree Conclusion and future work
Chen Li, Jiaheng Lu, Yiming Lu 32 Length Filtering Ed(s,t) 2 s: t: Length: 19 Length: 10 By length only!
Chen Li, Jiaheng Lu, Yiming Lu 33 Positional Filtering ab ab Ed(s,t) 2 s t (ab,1) (ab,12)
Chen Li, Jiaheng Lu, Yiming Lu 34 Filter tree … Length level Gram level Position level Inverted list root 2 n1 3 … zyzz abaa 12m …
Chen Li, Jiaheng Lu, Yiming Lu 35 Surprising experimental results (DBLP) No filter (ms) Length (ms) Length+Pos (ms) DivideSkip Why adding position filter increases the running time?
Chen Li, Jiaheng Lu, Yiming Lu 36 Filters fragment inverts lists Applying filters Merge Cost: (1)Tree traversal (2)More merging Saving: reduce total lists size
Chen Li, Jiaheng Lu, Yiming Lu 37 Conclusion Three new merge algorithms We run faster Interesting finding: Do not abuse filters!
Chen Li, Jiaheng Lu, Yiming Lu 38 Related work Approximate string matching [Navarro 2001] Varied length Grams [Li et al 2007] Fuzzy lookup in
Chen Li, Jiaheng Lu, Yiming Lu 39 References 1. [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik Efficient Exact Set-similarity Joins in VLDB [Chaudhuri 2003] S. Chaudhuri,K Ganjam, V. Ganti and R. Motwani Robust and Efficient Fuzzy Match for online Data Cleaning in SIGMOD [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava Approximate string joins in a database almost for free in VLDB 2001
Chen Li, Jiaheng Lu, Yiming Lu 40 References 4. [Li 2007] C. Li, B Wang and X. Yang VGRAM:Improving performance of approximate queries on string collections using variable- length grams in VLDB [Navarro 2001] G. Navarro, A guided tour to approximate string matching in Computing survey [Sarawagi 2004] S. Sarawagi and A. Kirpal, Efficient set joins on similarity predicates in ACM SIGMOD 2004