Download presentation
Presentation is loading. Please wait.
1
Efficient Merging and Filtering Algorithms for Approximate String Searches
Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu
2
Example: a movie database
Find movies starred Schwarrzenger. Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Schwarzenegger The Terminator 1984 The man 2006 Crime
3
In general: Gap between Queries and Data
Errors in the query The user doesn’t remember a string exactly The user unintentionally types a wrong string Query: Schwarrzenger. Data : Schwarzenegger … …
4
Data may not clean Errors in the database:
Data often is not clean by itself, especially true in data integration and cleansing Relation R Relation S Star Keanu Reeves Samuel L. Jackson Schwarzenegger Star Keanu Reeves Samuel Jackson Schwarzenegger
5
Query may include error
6
Problem definition: approximate string searches
Collection of strings s Star Search Keanu Reeves Samuel Jackson Query q Schwarzenegger Samuel Jackson … Output: strings s that satisfy Sim(q,s)≤δ
7
Example Similarity Function: Edit Distance
A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2
8
Example: approximate string searches
Collection of strings s Star Search Tom Hank Thomas Hanks Query q Ton Hank Tom Hanks Tom J. Hanks … Output: strings s that satisfy ed(q,s)≤2
9
Outline Problem motivation Preliminary Merge algorithms
Grams Inverted lists Merge algorithms Filtering technique Conclusion
10
String Grams q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) u n i v
For example: 2-gram u n i v e r s a l (un),(ni),(iv),(ve),(er),(rs),(sa),(al) 10 10
11
Inverted lists at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich
Convert strings to gram inverted lists 4 2 3 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static
12
Performance bottleneck!
Main Example st 1,2,3,4 Merge Candidate string ids {1,2,3,4} Query ed(s,q)≤1 ti 1,2,4 (st,ti,ic,ck) stick ic 0,1,2,4 count >=2 ck 1,3 Double check for the real edit distance Grams Data ck ic st ta ti … 1,3 id strings rich 1 stick 2 stich 3 stuck 4 static Final answers 0,1,2,4 Performance bottleneck! {1,2,3} 1,2,3,4 4 1,2,4
13
Sub-problem definitions:
Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences ≥ T.
14
Example Count threshold: 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 13
15
Outline Problem motivation Preliminary Merge algorithms
Two previous algorithms Our proposed three algorithms Filtering technique Conclusion
16
Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
[Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkip DivideSkip
17
Two previous algorithms (1)
Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap
18
Example of HeapMerger [Sarawagi et al 2004]
1 minHeap 10 5 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4
19
Five Merge Algorithms MergeOpt [Sarawagi 2004] HeapMerger ScanCount
Previous New ScanCount MergeSkip DivideSkip
20
Two previous algorithms (2)
MergeOpt Algorithm Binary search Long Lists: T-1 Short Lists
21
Example of MergeOpt [Sarawagi et al 2004]
Min-heap 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold ≥ 4
22
Can we run faster?
23
Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
Previous New ScanCount MergeSkip DivideSkip
24
Use an array to record # of occurrences of each element
Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each element
25
ScanCount Example Count threshold ≥ 4 1 2 4 Result:13
1 2 4 Result:13 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4
26
Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
Previous New ScanCount MergeSkip DivideSkip
27
Our new algorithms (2) …… MergeSkip algorithm T-1 Pop T-1 Min-heap
Jump T-1
28
Example of MergeSkip Count threshold ≥ 4 minHeap 1 3 5 10 13 10 13 15
7 13 13 15 Count threshold ≥ 4
29
Example of MergeSkip Count threshold ≥ 4 minHeap 1 5 10 13 15 1 3 5 10
7 13 13 15 Count threshold ≥ 4
30
Example of MergeSkip Count threshold ≥ 4 Pop 1, 5,10 minHeap 13 15 1 3
7 13 13 15 Count threshold ≥ 4
31
Example of MergeSkip Count threshold ≥ 4 Pop 1, 5,10 minHeap Jump ≥ 13
15 1 3 5 10 13 10 13 15 5 7 13 13 15 Jump ≥ 13 Count threshold ≥ 4
32
Example of HeapMerger Count threshold ≥ 4 minHeap Result:13 13 13 13
15 1 3 5 10 13 10 13 15 5 7 13 13 15 Result:13 Count threshold ≥ 4
33
Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
Previous New ScanCount MergeSkip DivideSkip
34
Long Lists: dynamic size
Our new algorithms (3) DivideSkip Algorithm MergeSkip Binary search Long Lists: dynamic size Short Lists
35
Size of long lists How many lists are treated as long lists? Cost:
MergeOpt Binary search Long Lists Short Lists 35
36
Size of long lists How many lists are treated as long lists? Cost:
MergeSkip Binary search Long Lists Short Lists 36
37
Decide L value A good balance in the tradeoff:
# of long lists = T / ( μ logM +1) 37 37
38
Empirically verification
Our formula about “L” achieves the best result over other options. 38
39
Experimental data sets
Three real data sets have various string lengths and data sizes DBLP data IMDB data Google Web corpus
40
Performance (DBLP data)
DivideSkip is the best one Running time per query with various algorithms
41
# of elements reading (DBLP data)
DivideSkip is the best one DivideSkip skips reading the most elements
42
Outline Problem motivation Preliminary Merge algorithms
Filtering technique Length, positional filter [Gravano et al. VLDB 2001] Filter tree Conclusion and future work
43
Length Filtering s: t: Length: 10 By length only! Ed(s,t) ≤ 2
44
Positional Filtering s Ed(s,t) ≤ 2 a b t a b Positional Gram
For example: string abcd: {(ab,1),(bc,2),(cd,3)} Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12)
45
Filter tree … … root 2 n 1 3 zy zz ab aa m Length level Gram level
Position level 5 12 17 28 44 Inverted list
46
Surprising experimental results(DBLP)
No filter Length Length+Pos Heap 115.42 11.98 3.64 MergeOpt 14.22 1.40 6.78 ScanCount 30.91 2.68 2.14 MergeSkip 10.12 1.09 2.65 DivideSkip 2.23 0.76 1.96 Wisely use filters, more filters may be bad!
47
Conclusion Three new merge algorithms Surprising experimental results
We run faster Surprising experimental results Wisely use filters, more filters may be bad!
48
Thank you!
49
Backup : related work Approximate string matching Fuzzy lookup in
[Navarro 2001] Fuzzy lookup in Varied length Grams [Li et al 2007]
50
Reference [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001
51
Reference 4. [Li 2007] C. Li, B Wang and X. Yang “VGRAM:Improving performance of approximate queries on string collections using variable-length grams ” in VLDB 2007 5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001 6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.