Download presentation
Presentation is loading. Please wait.
Published bySamantha Gill Modified over 11 years ago
1
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng Lu 2 1 University of California, Irvine 2 Renmin University of China
2
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
3
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Data Cleaning Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008 Real-world data is dirty Typos Inconsistent representations (PO Box vs. P.O. Box) Approximately check against clean dictionary Should clearly be Niels Bohr
4
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Record Linkage NameHobbiesAddress Brad Pitt…… Forest Whittacker…… George Bush…… Angelina Jolie…… Arnold Schwarzenegger…… PhoneAgeName ……Brad Pitt ……Arnold Schwarzeneger ……George Bush ……Angelina Jolie ……Forrest Whittaker We want to link records belonging to the same entity No exact match! The same entity may have similar representations Arnold Schwarzeneger versus Arnold Schwarzenegger Forrest Whittaker versus Forest Whittacker
5
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Query Relaxation http://www.google.com/jobs/britney.html Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google
6
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai What is Approximate String Search? String Collection: (People) Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzeneger … Queries against collection: Find all entries similar to Forrest Whitaker Find all entries similar to Arnold Schwarzenegger Find all entries similar to Brittany Spears What do we mean by similar to? - Edit Distance - Jaccard Similarity - Cosine Similaity - Dice - Etc. The similar to predicate can help our described applications! How can we support these types of queries efficiently?
7
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approximate Query Answering Main Idea: Use q-grams as signatures for a string irvine 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Inverted index on grams supports finding all data strings sharing enough grams with a query Sliding Window
8
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approximate Query Example Query: irvine, Edit Distance 1 2-grams {ir, rv, vi, in, ne} tfviirefrvneun in …… Lookup Grams 2-grams 134579134579 5959 1515 12391239 3939 7979 569569 Inverted Lists (stringIDs) 1245612456 Each edit operations can destroy at most q grams Answers must share at least T = 5 – 1 * 2 = 3 grams T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold. Candidates = {1, 5, 9} May have false positives Need to compute real similarity
9
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Compression Inverted index can be very large compared to source data May need to fit in memory for fast query processing Can we compress the index to fit into a space budget? Index-Size Estimation Each string produces |s| - q + 1 grams For each gram we add one element to its inverted list (a 4-byte uint) With ASCII encoding the index is ~4x as large as the original data!
10
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Related Work IR community developed many lossless compression algorithms for inverted lists (mostly in a disk-based setting) Mainly use delta representation + packing If inverted lists are in memory these techniques always impose decompression overhead Difficult to tune compression ratio How to overcome these limitations in our setting?
11
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai This Paper We developed two lossy compression techniques We answer queries exactly Index can fit into a space budget (space constraint) Queries can become faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms can be re-used (even with compression specific optimizations)
12
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
13
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approach 1: Discarding Lists tfviirefrvneun in …… 2-grams 134579134579 5959 1515 12391239 3939 7979 569569 Inverted Lists (stringIDs) 1245612456 BEFOREBEFORE tfviirefrvneun in …… 2-grams Inverted Lists (stringIDs) AFTERAFTER 5959 1515 7979 569569 1245612456 Lists discarded, Holes
14
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Effects on Queries Need to decrease merging-threshold T Lower T more false positives to post-process If T <= 0 we panic, need to scan entire collection and compute true similarities Surprisingly! Query Processing time can decrease because fewerlists to consider
15
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai sha han ang ngh gha hai ter … Query shanghai, Edit Distance 1 3-grams {sha, han, ang, ngh, gha, hai} uni ing 3-grams Hole grams Regular grams Merging-threshold without holes, T = #grams – ed * q = 6 – 1 * 3 = 3 Basis: Each Edit Operation can destroy at most q=3 grams Naïve new Merging-Threshold T = T – #holes = 0 Panic! Can we really destroy at most q=3 non-hole grams with each edit operation? sha han ang ngh gha hai Delete a Delete g Can destroy at most 2 grams with 1 Edit Operation! New Merging-Threshold T = 1 We use Dynamic Programming to compute tighter T
16
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Discard One extreme: query is entirely unaffected Other extreme: query becomes panic Good choice of lists depends on query workload Many combinations of lists to discard that satisfy memory constraint, checking all is infeasible How can we make a reasonable choice efficiently?
17
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Discard Input: Memory Constraint Inverted Lists L Query Workload W Output: Lists to Discard D DiscardLists { While(Memory Constraint Not Satisfied) { For each list in L { t = estimateImpact(list, W) benefit = list.size() } discard = use ts and benefits to choose list add discard to D remove discard from L } How can we do this efficiently? Perhaps incrementally? Times needed: List-Merging Time Post-Processing Time Panic Time What exactly should we minimize? benefit / cost? cost only? We could ignore benefit…
18
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Discard Estimating Query Times With Holes List-Merging Time:cost function, parameters decided offline with linear regression Post-Processing Time: #candidates * average compute similarity time Panic Time: #strings * average compute similarity time #candidates depends on T, data distribution, number of holes Incremental-ScanCount Algorithm 20 33 2 4 0010 0123456789 StringIDs Counts Before Discarding List T = 3 #candidates = 3 20 22 1 4 0000 0123456789 StringIDs Counts After Discarding List T = T – 1 = 2 #candidates = 4 2 List to discard 3 4 8 decrement counts Many more ways to improve speed of DiscardLists, this is just one example…
19
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
20
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approach 2: Combining Lists tfviirefrvneun in …… 2-grams 134579134579 5959 569569 12391239 139139 7979 6969 Inverted Lists (stringIDs) 1245612456 BEFOREBEFORE 2-grams Inverted Lists (stringIDs) AFTERAFTER tfviirefrvneun in …… 134579134579 569569 12391239 7979 6969 1245612456 Lists combined Intuition: Combine correlated lists.
21
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Effects on Queries Merging-threshold T is unchanged (no new panics) Lists become longer: More time to traverse lists More false positives List-Merging Optimization 3-grams {sha, han, ang, ngh, gha, hai} combined refcount = 2 combined refcount = 3 Traverse physical lists once. Count for stringIDs on physical lists increased by refcount instead of 1
22
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Combine Discovering candidate gram pairs Frequent q+1-grams correlated adjacent q-grams Using Locality-Sensitive Hashing (LSH) Selecting candidate pairs to combine Based on estimated cost on query workload Similar to DiscardList Different Incremental ScanCount algorithm
23
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
24
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Experiments Datasets: Google WebCorpus (word grams) IMDB Actors Queries: picked from dataset, Zipf distributed q=3, Edit Distance=2 Overview: Performance of flavors of DiscardLists & CombineLists Scalability with increasing index size Comparison with IR compression technique Comparison with VGRAM What if workload changes from training workload
25
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Experiments DiscardLists CombineLists Runtime decreases!
26
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Experiments Uncompressed Compressed Uncompressed Comparison with IR compression technique
27
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Experiments Uncompressed Compressed Uncompressed Comparison with variable-length gram technique, VGRAM
28
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Future Work DiscardLists, CombineLists and IR compression could be combined When considering filter tree, global vs. local decisions How to minimize impact on performance if workload change
29
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Conclusion We developed two lossy compression techniques We answer queries exactly Index can fit into a space budget (space constraint) Queries can become faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms can be re-used (even with compression specific optimizations)
30
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai More Experiments What if the workload changes from the training workload?
31
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai More Experiments What if the workload changes from the training workload?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.