Download presentation
Presentation is loading. Please wait.
Published byAnnabel Marshall Modified over 6 years ago
1
Efficient Approximate Search on String Collections Part I
Marios Hadjieleftheriou Chen Li
2
DBLP Author Search
3
Try their names (good luck!)
Case Western AT&T--Research UCSD Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou
4
5
Better system?
6
People Search at UC Irvine
7
Web Search Errors in queries Errors in data
Bring query and meaningful results closer together Actual queries gathered by Google 7
8
Data Cleaning R S informix microsoft … infromix … mcrosoft
9
Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! 10 ms: 100 queries per second (QPS) 5 ms: 200 QPS
10
Outline Motivation Preliminaries Trie-based approach
Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II
11
Next… Preliminaries
12
Similarity Functions Similar to: Examples: a domain-specific function
returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey
13
Edit Distance A widely used metric to define string similarity
Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 13 13
14
Gram-based algorithms
Next… Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08]
15
“q-grams” of strings u n i v e r s a l 2-grams
16
Edit operation’s effect on grams
Fixed length: q u n i v e r s a l k operations could affect k * q grams If ed(s1,s2) <= k, then their # of common grams >= (|s1|- q + 1) – k * q 16
17
q-gram inverted lists at ch ck ic ri st ta ti tu uc id strings 1 2 3 4
1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static
18
Searching using inverted lists
Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ti ic ck # of common grams >= 3 at ch ck ic ri st ta ti tu uc 4 2 3 1 id strings 1 2 3 4 rich stick stich stuck static 2-grams
19
Find elements whose occurrences ≥ T
T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T
20
Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13
21
List-Merging Algorithms
HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip
22
Count # of occurrences of each element using a heap
Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap
23
MergeOpt Algorithm [SK04]
Binary search Long Lists: T-1 Short Lists
24
Example of MergeOpt Count threshold T≥ 4 Long Lists: 3 Short Lists: 2
1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4
25
ScanCount Count threshold T≥ 4 1 2 3 … 1 1 3 5 10 13 10 13 15 5 7 13
String ids # of occurrences Increment by 1 … 1 1 3 5 10 13 10 13 15 5 7 13 13 15 1 13 4 Result! 14 15 2 Count threshold T≥ 4 25
26
List-Merging Algorithms
HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip
27
MergeSkip algorithm [BK02, LLL08]
Pop T-1 …… Min-heap Jump Greater or equals T-1
28
Example of MergeSkip Count threshold T≥ 4 minHeap Jump 1 5 10 13 15 1
7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4
29
DivideSkip Algorithm [LLL08]
Binary search MergeSkip Long Lists Short Lists
30
How many lists are treated as long lists?
31
Length Filtering s: t: Length: 10 By length only! Ed(s,t) ≤ 2
32
Positional Filtering Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12)
33
A filter tree Combine filters with list-merging algorithms [LLL08]
34
Variable-length grams (VGRAM) [LWY07,YWL08]
Next… Variable-length grams (VGRAM) [LWY07,YWL08]
35
2-grams -> 3-grams? sht hti tic ick tic ick
Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick tic ick # of common grams >= 1 ati ich ick ric sta sti stu tat tic tuc uck 4 2 1 3 id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static 3-grams
36
Observation 1: dilemma of choosing “q”
Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings 4 2 3 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static
37
Observation 2: skew distributions of gram frequencies
DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio
38
VGRAM: Main idea Grams with variable lengths (between qmin and qmax)
zebra ze(123) corrasion co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms
39
Challenges Generating variable-length grams?
Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms?
40
Challenge 1: String Variable-length grams?
Fixed-length 2-grams u n i v e r s a l Variable-length grams ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l
41
Representing gram dictionary as a trie
ni ivr sal uni vers
42
Step 2: Constructing a gram dictionary
qmin=2 qmax=4 Frequency-based [LYW07] Cost-based [YLW08]
43
Challenge 3: Edit operation’s effect on grams
Fixed length: q u n i v e r s a l k operations could affect k * q grams
44
Deletion affects variable-length grams
Not affected Not affected Affected i-qmax+1 i i+qmax- 1 Deletion
45
With 2 edit operations, at most 4 grams can be affected
Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index Vector of s = <2,4,6,8,9> With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering
46
Summary of VGRAM index
47
Challenge 4: adopting VGRAM
Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min # of their common grams
48
Lower bound on # of common grams
Fixed length (q) u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >=: (|s1|- q + 1) – k * q Variable lengths: # of grams of s1 – NAG(s1,k)
49
Example: algorithm using inverted lists
Query: “shtick”, ED(shtick, ?)≤1 sh ht tick tick 2-grams 2-4 grams 2 4 1 3 … ck ic ich tic tick 1 2 4 3 … ck ic ti Lower bound = 3 id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static Lower bound = 1
50
End of part I Motivation Preliminaries Trie-based approach
Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.