Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.

Similar presentations


Presentation on theme: "Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08."— Presentation transcript:

1 Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08 30 Mar 2011 Taewhi Lee

2 Outline  Motivation  Preliminaries  Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms  Experiments  Conclusion 2

3 Motivation: Data Cleaning and Integration 3 NamePhoneAddr Brad Pitt…… Samuel L. Jackson…… Keanu Reeves…… Angelina Jolie…… Arnold Schwarzenegger…… BirthAgeName ……Brad Pitt ……Arnold Schwarzeneger ……Keanu Reeves ……Angelina Jolie ……Samuel Jackson No exact match!  Real-world data is dirty – Inconsistent representations – Typos

4 Motivation: Query Relaxation 4 StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi Arnold SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2005Crime Find movies starred Schwarrzenger  Allowing errors in queries or data

5 Approximation String Searches 5 Query q: Schwarrzenger Collection of strings s Search Output: strings s that satisfy Sim(q,s) ≤ δ Sim functions: edit distance, Jaccard coefficient,cosine similarity, … Star Keanu Reeves Samuel Jackson Arnold Schwarzenegger …

6 Outline  Motivation  Preliminaries  Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms  Experiments  Conclusion 6

7 Approximate Query Answering 7  Main idea: use q-grams as signatures for a string stick 2-grams = {st, ti, ic, ck}  Intuition: similar strings share a certain number of grams  Inverted index on grams supports finding all data strings sharing enough grams with a query

8 Gram Inverted Index 8  Convert strings to gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3

9 Approximate Query Example 9 Query ed(s,q) ≤ 1 stick  {st,ti,ic,ck} Merge count >=2 Candidates id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 Grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3

10 T-Occurrence Problem 10 Find elements whose occurrences ≥ T among inverted lists Ascending order Merge

11 T-Occurrence Problem Example 11 Result: {13} 1 3 5 10 13 10 13 15 5 7 13 15 T=4

12 Outline  Motivation  Preliminaries  Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms  Experiments  Conclusion 12

13 Five Merge Algorithms 13 HeapMerger [Sarawagi,SIGMOD ’04] MergeOpt [Sarawagi,SIGMOD ‘04] Previous New ScanCount MergeSkipDivideSkip

14 Heap-Based Algorithm (HeapMerger) 14  Count # of the occurrences of each element by a heap Min-heap Push to heap …… Maintain frontiers of the lists as a heap

15 Heap-Based Algorithm Example 15 1 3 5 10 13 10 13 15 5 7 13 15 T=4 10 1 5 1315 10 5 13 15 10 3 5 1513

16 Heap-Based Algorithm Example 16 1 3 5 10 13 10 13 15 5 7 13 15 T=4 10 3 5 1513 10 5 13 15 10 5 5 1513

17 Heap-Based Algorithm Example 17 1 3 5 10 13 10 13 15 5 7 13 15 T=4 10 5 5 1513 10 5 13 15 10 13 15 7 10 1310

18 MergeOpt Algorithm 18 Long Lists: T-1 Short Lists Binary search

19 MergeOpt Algorithm Example 19 Long Lists: 3 Short Lists: 2 1 3 5 10 13 10 13 15 5 7 13 15 T=4

20 Five Merge Algorithms 20 Previous New ScanCount MergeSkipDivideSkip HeapMergerMergeOpt

21 ScanCount Algorithm 21 123…123… 1 3 5 10 13 10 13 15 5 7 13 15 0 0 0 4 1 Increment by 1 1 String ID 13 14 15 0 2 0 0 Result! T=4

22 MergeSkip Algorithm 22 Min-heap …… T-1 Jump Greater or equals Pop T-1 elements

23 MergeSkip Algorithm Example 23 1 3 5 10 15 5757 1315 Min-heap 10 1315 1 5 Jump 15 13 17 T=4

24 DivideSkip Algorithm 24  MergeOpt + MergeSkip Long Lists: T-LShort Lists MergeSkip Binary search

25 Outline  Motivation  Preliminaries  Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms  Experiments  Conclusion 25

26 Experiment Settings  3-gram, edit distance  Ubuntu OS, GNU C++  2.13GHz dual core CPU, 2GB RAM Dataset# of records avg. size of gram list # of unique grams DBLP (paper titles) 274,788 (17.8MB) 6759,940 IMDB (actor names) 1,199,299 (22MB) 1934,737 WEB Corpus (English words) 2,000,000 (48.3MB) 2681,620 26

27 Experiment Results: Query Time (DBLP) 27 DivideSkip is the best

28 Experiment Results: # of Visited Strings (DBLP) 28 DivideSkip is the best

29 Experiment Results: Tradeoff in DivideSkip 29 Short Lists Long Lists Lookup Merge

30 Outline  Motivation  Preliminaries  Merging Algorithms – Previous Two Algorithms – Proposed Three Algorithms  Experiments  Conclusion 30

31 Conclusion  Three new algorithms  We run faster 31 ScanCount MergeSkipDivideSkip

32 Thank You! Any questions or comments?


Download ppt "Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08."

Similar presentations


Ads by Google