1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.
Jiaheng Lu, University of California, Irvine
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
UNITED NATIONS Shipment Details Report – January 2006.
David Burdett May 11, 2004 Package Binding for WS CDL.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
PP Test Review Sections 6-1 to 6-6
EU market situation for eggs and poultry Management Committee 20 October 2011.
EU Market Situation for Eggs and Poultry Management Committee 21 June 2012.
Bright Futures Guidelines Priorities and Screening Tables
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Association Rule Mining
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
VOORBLAD.
1 public class Newton { public static double sqrt(double c) { double epsilon = 1E-15; if (c < 0) return Double.NaN; double t = c; while (Math.abs(t - c/t)
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Adding Up In Chunks.
25 seconds left…...
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
& dding ubtracting ractions.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
Select a time to count down from the clock above
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Efficient Approximate Search on String Collections Part I
Presentation transcript:

1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

Chen Li, Jiaheng Lu, Yiming Lu 22 Example: a movie database StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies starred Schwarrzenger.

Chen Li, Jiaheng Lu, Yiming Lu 33 Data may not clean Star Keanu Reeves Samuel Jackson Schwarzenegger Relation RRelation S Data integration and cleaning: Star Keanu Reeves Samuel L. Jackson Schwarzenegger

Chen Li, Jiaheng Lu, Yiming Lu 44 Problem definition: approximate string searches … Schwarzenger Samuel Jackson Keanu Reeves Star Query q: Collection of strings s Search Output: strings s that satisfy Sim(q,s) δ Sim functions: edit distance, Jaccard Coefficient and Cosine similarity Schwarrzenger

Chen Li, Jiaheng Lu, Yiming Lu 55 Outline Problem motivation Preliminaries Grams Inverted lists Merge algorithms Filtering techniques Conclusion

Chen Li, Jiaheng Lu, Yiming Lu 66 String Grams q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) For example: 2-gram universal

Chen Li, Jiaheng Lu, Yiming Lu 77 Inverted lists Convert strings to gram inverted lists id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc

Chen Li, Jiaheng Lu, Yiming Lu 88 Main Example Query Merge DataGrams stick (st,ti,ic,ck) count >=2 idstrings 0rich 1stick 2stich 3stuck 4static ck ic st ta ti … 1,3 1,2,3,4 4 1,2,4 ed(s,q)1 0,1,2,4 Candidates

Chen Li, Jiaheng Lu, Yiming Lu 99 Problem definition: Find elements whose occurrences T Ascending order Merge

Chen Li, Jiaheng Lu, Yiming Lu 10 Example T = 4 Result:

Chen Li, Jiaheng Lu, Yiming Lu 11 Contributions Three new merge algorithms New finding: wisely using filters

Chen Li, Jiaheng Lu, Yiming Lu 12 Outline Problem motivation Preliminaries Merge algorithms Two previous algorithms Our proposed three algorithms Filtering techniques Conclusion

Chen Li, Jiaheng Lu, Yiming Lu 13 Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkipDivideSkip

Chen Li, Jiaheng Lu, Yiming Lu 14 Heap-based Algorithm Min-heap Count # of the occurrences of each element by a heap Push to heap ……

Chen Li, Jiaheng Lu, Yiming Lu 15 MergeOpt Algorithm Long Lists: T-1Short Lists Binary search

Chen Li, Jiaheng Lu, Yiming Lu 16 Example of MergeOpt [Sarawagi et al 2004] Count threshold T 4 Long Lists: 3 Short Lists: 2

Chen Li, Jiaheng Lu, Yiming Lu 17 Can we run faster?

Chen Li, Jiaheng Lu, Yiming Lu 18 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip

Chen Li, Jiaheng Lu, Yiming Lu 19 ScanCount Example 123…123… Count threshold T 4 # of occurrences Increment by 1 1 String ids Result!

Chen Li, Jiaheng Lu, Yiming Lu 20 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip

Chen Li, Jiaheng Lu, Yiming Lu 21 MergeSkip algorithm Min-heap …… Pop T-1 T-1 Jump Greater or equals

Chen Li, Jiaheng Lu, Yiming Lu 22 Example of MergeSkip Count threshold T 4 minHeap Jump

Chen Li, Jiaheng Lu, Yiming Lu 23 Skip is safe Min-heap …… # of occurrences of skipped elements T-1 Skip

Chen Li, Jiaheng Lu, Yiming Lu 24 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip

Chen Li, Jiaheng Lu, Yiming Lu 25 DivideSkip Algorithm Long ListsShort Lists Binary search MergeSkip

Chen Li, Jiaheng Lu, Yiming Lu 26 How many lists are treated as long lists? ? Short Lists Merge Long Lists Lookup

Chen Li, Jiaheng Lu, Yiming Lu 27 Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1)

Chen Li, Jiaheng Lu, Yiming Lu 28 Experimental data sets DBLP dataIMDB dataGoogle Web corpus

Chen Li, Jiaheng Lu, Yiming Lu 29 Performance (DBLP) DivideSkip is the best one

Chen Li, Jiaheng Lu, Yiming Lu 30 # of access elements (DBLP) DivideSkip is the best one

Chen Li, Jiaheng Lu, Yiming Lu 31 Outline Problem motivation Preliminaries Merge algorithms Filtering techniques Length, positional filters Filter tree Conclusion and future work

Chen Li, Jiaheng Lu, Yiming Lu 32 Length Filtering Ed(s,t) 2 s: t: Length: 19 Length: 10 By length only!

Chen Li, Jiaheng Lu, Yiming Lu 33 Positional Filtering ab ab Ed(s,t) 2 s t (ab,1) (ab,12)

Chen Li, Jiaheng Lu, Yiming Lu 34 Filter tree … Length level Gram level Position level Inverted list root 2 n1 3 … zyzz abaa 12m …

Chen Li, Jiaheng Lu, Yiming Lu 35 Surprising experimental results (DBLP) No filter (ms) Length (ms) Length+Pos (ms) DivideSkip Why adding position filter increases the running time?

Chen Li, Jiaheng Lu, Yiming Lu 36 Filters fragment inverts lists Applying filters Merge Cost: (1)Tree traversal (2)More merging Saving: reduce total lists size

Chen Li, Jiaheng Lu, Yiming Lu 37 Conclusion Three new merge algorithms We run faster Interesting finding: Do not abuse filters!

Chen Li, Jiaheng Lu, Yiming Lu 38 Related work Approximate string matching [Navarro 2001] Varied length Grams [Li et al 2007] Fuzzy lookup in

Chen Li, Jiaheng Lu, Yiming Lu 39 References 1. [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik Efficient Exact Set-similarity Joins in VLDB [Chaudhuri 2003] S. Chaudhuri,K Ganjam, V. Ganti and R. Motwani Robust and Efficient Fuzzy Match for online Data Cleaning in SIGMOD [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava Approximate string joins in a database almost for free in VLDB 2001

Chen Li, Jiaheng Lu, Yiming Lu 40 References 4. [Li 2007] C. Li, B Wang and X. Yang VGRAM:Improving performance of approximate queries on string collections using variable- length grams in VLDB [Navarro 2001] G. Navarro, A guided tour to approximate string matching in Computing survey [Sarawagi 2004] S. Sarawagi and A. Kirpal, Efficient set joins on similarity predicates in ACM SIGMOD 2004