Download presentation
Presentation is loading. Please wait.
Published byDiane Quinn Modified over 9 years ago
1
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng Lu 2 1 University of California, Irvine 2 Renmin University of China
2
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Data Cleaning Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008 Should clearly be “Niels Bohr”
3
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Record Linkage NameHobbiesAddress Brad Pitt…… Forest Whittacker…… George Bush…… Angelina Jolie…… Arnold Schwarzenegger…… PhoneAgeName ……Brad Pitt ……Arnold Schwarzeneger ……George Bush ……Angelina Jolie ……Forrest Whittaker No exact match!
4
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Query Relaxation http://www.google.com/jobs/britney.html Actual queries gathered by Google
5
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai What is Approximate String Search? Query against collection: Find entries similar to “Arnold Schwarseneger” What do we mean by similar to? - Edit Distance - Jaccard Similarity - Cosine Similarity - Dice - Etc. How can we support these types of queries efficiently? String Collection Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzenegger …
6
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approximate Query Answering irvine 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Sliding Window
7
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approximate Query Example Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne} tfviirefrvneun in …… Lookup Grams 2-grams 134579134579 5959 1515 12391239 3939 7979 569569 Inverted Lists (stringIDs) 1245612456 Count >= 3 Candidates = {1, 5, 9} May have false positives 134579134579 1515 12391239 7979 569569
8
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai T-Occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge
9
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Compression Inverted Index >> Source Data Fit in memory? Space Budget?
10
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Related Work IR: lossless compression of inverted lists (disk-based) Delta representation + compact encoding Inverted lists in memory: decompression overhead Tune compression ratio? Overcome these limitations in our setting?
11
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Main Contributions Two lossy compression techniques Answer queries exactly Index fits into a space budget Queries faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms: re-use + compression specific optimizations
12
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
13
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approach 1: Discarding Lists tfviirefrvneun in …… 2-grams 134579134579 5959 1515 12391239 3939 7979 569569 Inverted Lists (stringIDs) 1245612456 Lists discarded, “Holes”
14
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Effects on Queries Decrease lower bound T on common grams Smaller T more false positives T <= 0 “panic”, scan entire string collection Surprise Fewer lists Faster Queries (depends)
15
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai sha han ang ngh gha hai ter … Query “shanghai”, Edit Distance 1 3-grams {sha, han, ang, ngh, gha, hai} uni ing 3-grams Hole grams Regular grams Basis: Edit Operations “destroy” q=3 grams No Holes: T = #grams – ed * q = 6 – 1 * 3 = 3 With holes: T’ = T – #holes = 0 Panic! Really destroy q=3 grams per edit operation? Dynamic Programming for tighter T
16
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Discard Good choice depends on query workload Space budget: Many combinations of grams Make a “reasonable” choice efficiently? Effect on Query Unaffected Panic Slower or Faster
17
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Discard INPUT: Space Budget, Inverted lists, Workload OUTPUT: Lists to discard tfviirefrvneun in …… Query1 Query2 Query3 … Total estimated running time t Estimated impact ∆t Incremental Update Choose one list at a time ALGORITHM: Greedy & Cost-Based
18
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Estimating Query Times List-Merging: cost function, offline with linear regression Panic: #strings * avg similarity time Post-Processing: #candidates * avg similarity time
19
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Estimating #candidates Incremental-ScanCount Algorithm 2 3 0 1 4 0 12 3 4 2 2 0 0 3 0 12 3 4 Counts StringIDs Counts StringIDs Decrement un 134134 List to Discard BEFORE T = 3 #candidates = 2 AFTER T’ = T-1 = 2 #candidates = 3
20
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
21
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approach 2: Combining Lists tfviirefrvneun in …… 2-grams 134579134579 5959 569569 12391239 139139 7979 6969 Inverted Lists (stringIDs) 1245612456 Lists combined
22
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Effects on Queries Lower bound T is unchanged (no new panics) Lists become longer: More time to traverse lists More false positives
23
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speeding Up Queries Query 3-grams {sha, han, ang, ngh, gha, hai} combined lists refcount = 2 combined lists refcount = 3 Traverse physical lists once. Count for stringIDs increases by refcount.
24
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Combine Discovering candidate gram pairs Frequent q+1-grams correlated adjacent q-grams Locality-Sensitive Hashing (LSH) Selecting candidate pairs to combine Basis: estimated cost on query workload Similar to DiscardLists Different Incremental ScanCount algorithm
25
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
26
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Experiments Datasets: Google WebCorpus Word Grams IMDB Actors DBLP Titles Overview: Performance & Scalability of DiscardLists & CombineLists Comparison with IR compression & VGRAM Changing workloads 10k Queries: Zipf distributed, from dataset q=3, Edit Distance=2, (also Jaccard & Cosine)
27
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Experiments DiscardLists CombineLists Runtime decreases!
28
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Comparison with IR compression Carryover-12 Uncompressed Compressed
29
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Comparison with variable-length grams, VGRAM Uncompressed Compressed
30
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Future Work Combine: DiscardLists, CombineLists and IR compression Filters for partitioning, global vs. local decisions Dealing with updates to index
31
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Conclusions Two lossy compression techniques Answer queries exactly Index fits into a space budget Queries faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms: re-use + compression specific optimizations
32
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Thank You! This work is part of The Flamingo Project http://flamingo.ics.uci.edu
33
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai More Experiments What if the workload changes from the training workload?
34
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai More Experiments What if the workload changes from the training workload?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.