Download presentation
Presentation is loading. Please wait.
1
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern University, China
2
2 Approximate selection queries Keanu Reeves Samuel Jackson Schwarzenegger Samuel Jackson … Schwarrzenger Query errors: Limited knowledge about data Typos Limited input device (cell phone) input Data errors Typos Web data OCR Applications Spellchecking Query relaxation … Similarity functions: Edit distance Jaccard Cosine …
3
3 Performance is a big issue Answer queries interactively Many queries on a server 5ms/query20ms/query 200 queries/second50 queries/second
4
4 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
5
5 q-grams b i n g o n 2-grams
6
6 q-gram inverted lists 2-grams id strings 123456123456 bingo bioinng bitingin biting boing going D0D0 gramstring ids bi1,2,3,4 bo5 gi3 go1,6 in1,2,3,3,4,5,6 io2 it3,4 ng1,2,3,4,5,6 nn2 oi2,5,6 ti3,4
7
7 Query processing 2-grams id strings 123456123456 bingo bioinng bitingin biting boing going ED(bingon, ?)≤1 D0D0 gramstring ids bi1,2,3,4 bo5 gi3 go1,6 in1,2,3,3,4,5,6 io2 it3,4 ng1,2,3,4,5,6 nn2 oi2,5,6 ti3,4 # of common grams >= 3
8
8 VGRAM: variable-length grams [VLDB07] [2,3]-gram dictionary b i n g o n gram bi bin bo gi go in ing io it ng nn oi ti
9
9 Adopting VGRAM in algorithms VGRAM gram dictionary string grams lower bound b i n g o n i n b n 4 o n 13 o n 10 n 3 i o n 11 t n 14 n 12 n n 15 n 5 g n n 16 n 6 i n 17 n 7 i n 18 n 1 t g n 24 g n 8 n 2 i o n 9 n 1919 n # n 20 # n 32 # n 21 # n 22 # n 23 # n 25 # n 26 # n 27 # n 28 # n 29 # n 30 # n 31 # n 33 # of common grams >= 3
10
10 Contributions of this study Tightening lower bounds using dynamic programming Cost-based quantitative approach Analyze and estimate query performance when adding each gram Automatically find high-quality grams Gram dictionary String collection High quality gram
11
11 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
12
12 Calculating lower bound ed(s1,s2) <= k, then # of common grams >= # of s1 grams – k * q Fixed length (q) b i i n d i n g
13
13 Calculating lower bound b i i n d i n g 1 2 3 2 3 2 1 1 lower bound = # of grams of s1 – NAG(s1,k) Variable lengths
14
14 Too pessimistic? k -Max: Summation of k largest values NAG(s,2)=3+3=6 1 2 3 2 3 2 1 1 b i i n d i n g
15
15 Tightening lower bound Dynamic programming: tightening NAG(s,k) Subproblems: NAG(s[1, j ], i ) String s j 1 op i
16
16 Dynamic programming Recurrence function String s j 1 op i B[ j ] op i op i-1
17
17 Dynamic programming 1 2 3 2 3 2 1 1 b i i n d i n g 000000000 012333333 012345555 k =0 k =1 k =2 NAG vector
18
18 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
19
19 Effects on inverted lists ab bc add gram abc Gram dictionary ab bc abc Gram dictionary string --abc-- --ab----bc--
20
20 Effects on query performance Decrease query’s inverted list Change lower bound Change # of candidates
21
21 Effects on query ’ s inverted lists ab bc add gram abc Gram dictionary ab bc abc Gram dictionary Query Q Adding a new gram abc will not change or decrease the query’s inverted lists ------------- -----ab-----------abc-----
22
22 Effects on lower bound Query Q ----abcd----- ----abcd----- Query: Q, ED(Q, ?)≤1
23
23 Effects on # of candidates Change lower bound change # of candidates Query Q ----abcd---- ab bc add gram abc Gram dictionary ab bc abc Gram dictionary ----abcd----
24
24 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
25
25 Construct a gram dictionary [VLDB07] q min =2 q max =4
26
26 Cost-base construction q min =2
27
27 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
28
28 Data sets Environment: GNU C++, Dell GX620 PC with an Intel Pentium 2.40Hz Dual Core CPU, 2GB memory, 250GB disk, Ubuntu (Linux) O.S. Index structure were assumed to be in memory Data setString #LengthRange of # of injected edit operations MinMaxAvg Article Titles277,000620766[1,6] Movie Titles855,000824935[1,3] Actor Names1,200,00047417[1,2]
29
29 Effect of Tightening Lower Bound 1M Actor names, Construct gram dictionary: 100,000 sample strings, 5000 queries, q min = 4
30
30 Comparison with algorithm Prune [VLDB07] Dataset: 1M article titles Prune: qmin=5, qmax=7, T=2000, LargeFirst policy GramGen: 1% sampling ratio, 2000 queries, (qmin=5 automatically determined)
31
31 Choosing q min Construct gram dictionary: (a) 3,000 queries, (b) sample ratio=2%
32
32 Conclusions Tightening lower bound Dynamic programming Analysis of adding a gram affects Index structure Performance of queries Efficient algorithm Automatically generating a high-quality gram dictionary
33
33 Thank you Questions or Comments?
34
34 Related work Approximate String Matching q-Grams, q-Samples Inside DBMS Substring matching Set similarity join Estimation Selectivity of SQL LIKE substring queries Approximate string answers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.