Download presentation
Presentation is loading. Please wait.
Published byOwen Wade Modified over 8 years ago
1
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08 30 Mar 2011 Taewhi Lee
2
Outline Motivation Preliminaries Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms Experiments Conclusion 2
3
Motivation: Data Cleaning and Integration 3 NamePhoneAddr Brad Pitt…… Samuel L. Jackson…… Keanu Reeves…… Angelina Jolie…… Arnold Schwarzenegger…… BirthAgeName ……Brad Pitt ……Arnold Schwarzeneger ……Keanu Reeves ……Angelina Jolie ……Samuel Jackson No exact match! Real-world data is dirty – Inconsistent representations – Typos
4
Motivation: Query Relaxation 4 StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi Arnold SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2005Crime Find movies starred Schwarrzenger Allowing errors in queries or data
5
Approximation String Searches 5 Query q: Schwarrzenger Collection of strings s Search Output: strings s that satisfy Sim(q,s) ≤ δ Sim functions: edit distance, Jaccard coefficient,cosine similarity, … Star Keanu Reeves Samuel Jackson Arnold Schwarzenegger …
6
Outline Motivation Preliminaries Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms Experiments Conclusion 6
7
Approximate Query Answering 7 Main idea: use q-grams as signatures for a string stick 2-grams = {st, ti, ic, ck} Intuition: similar strings share a certain number of grams Inverted index on grams supports finding all data strings sharing enough grams with a query
8
Gram Inverted Index 8 Convert strings to gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3
9
Approximate Query Example 9 Query ed(s,q) ≤ 1 stick {st,ti,ic,ck} Merge count >=2 Candidates id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 Grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3
10
T-Occurrence Problem 10 Find elements whose occurrences ≥ T among inverted lists Ascending order Merge
11
T-Occurrence Problem Example 11 Result: {13} 1 3 5 10 13 10 13 15 5 7 13 15 T=4
12
Outline Motivation Preliminaries Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms Experiments Conclusion 12
13
Five Merge Algorithms 13 HeapMerger [Sarawagi,SIGMOD ’04] MergeOpt [Sarawagi,SIGMOD ‘04] Previous New ScanCount MergeSkipDivideSkip
14
Heap-Based Algorithm (HeapMerger) 14 Count # of the occurrences of each element by a heap Min-heap Push to heap …… Maintain frontiers of the lists as a heap
15
Heap-Based Algorithm Example 15 1 3 5 10 13 10 13 15 5 7 13 15 T=4 10 1 5 1315 10 5 13 15 10 3 5 1513
16
Heap-Based Algorithm Example 16 1 3 5 10 13 10 13 15 5 7 13 15 T=4 10 3 5 1513 10 5 13 15 10 5 5 1513
17
Heap-Based Algorithm Example 17 1 3 5 10 13 10 13 15 5 7 13 15 T=4 10 5 5 1513 10 5 13 15 10 13 15 7 10 1310
18
MergeOpt Algorithm 18 Long Lists: T-1 Short Lists Binary search
19
MergeOpt Algorithm Example 19 Long Lists: 3 Short Lists: 2 1 3 5 10 13 10 13 15 5 7 13 15 T=4
20
Five Merge Algorithms 20 Previous New ScanCount MergeSkipDivideSkip HeapMergerMergeOpt
21
ScanCount Algorithm 21 123…123… 1 3 5 10 13 10 13 15 5 7 13 15 0 0 0 4 1 Increment by 1 1 String ID 13 14 15 0 2 0 0 Result! T=4
22
MergeSkip Algorithm 22 Min-heap …… T-1 Jump Greater or equals Pop T-1 elements
23
MergeSkip Algorithm Example 23 1 3 5 10 15 5757 1315 Min-heap 10 1315 1 5 Jump 15 13 17 T=4
24
DivideSkip Algorithm 24 MergeOpt + MergeSkip Long Lists: T-LShort Lists MergeSkip Binary search
25
Outline Motivation Preliminaries Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms Experiments Conclusion 25
26
Experiment Settings 3-gram, edit distance Ubuntu OS, GNU C++ 2.13GHz dual core CPU, 2GB RAM Dataset# of records avg. size of gram list # of unique grams DBLP (paper titles) 274,788 (17.8MB) 6759,940 IMDB (actor names) 1,199,299 (22MB) 1934,737 WEB Corpus (English words) 2,000,000 (48.3MB) 2681,620 26
27
Experiment Results: Query Time (DBLP) 27 DivideSkip is the best
28
Experiment Results: # of Visited Strings (DBLP) 28 DivideSkip is the best
29
Experiment Results: Tradeoff in DivideSkip 29 Short Lists Long Lists Lookup Merge
30
Outline Motivation Preliminaries Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms Experiments Conclusion 30
31
Conclusion Three new algorithms We run faster 31 ScanCount MergeSkipDivideSkip
32
Thank You! Any questions or comments?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.