Download presentation
Presentation is loading. Please wait.
Published byLee Anderson Modified over 8 years ago
1
1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara http://www.cs.ucsb.edu/~tamer
2
2 Whole/Substring Matching Problem Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1-2 % of database size). query string database string
3
3 String Similarity Motivation: Applications Genetic sequence databases, NCBI Text databases, spell checkers, web search. Video databases (e.g. VIRAGE, MEDIA360) Database size is too large. Most of the techniques available are in-memory. Space requirement of current indexes is too large. Year Base Pairs (millions)
4
4 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion
5
5 Notation q : query string. m,n : length of strings. r : range query radius. = r/|q|: error rate.
6
6 String Similarity: an example A C T - - T A G C R I I D A A T G A T A G -
7
7 Background Edit operations: Insert Delete Replace Edit distance (ED) between s 1 and s 2 = minimum number of edit operations to transform s 1 to s 2. Finding the edit distance is costly. O(mn) time and space if m and n are lengths of s 1 and s 2 if dynamic programming is used [NW70, SW81].
8
8 Related Work Lossless search Online [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius. [WM92] (Wu, Manber) binary masks, O(rn). [BYN99] (Beaze-Yates, Navarro) NFA Offline (index based) [Mye94] (Myers) condensed r-neighborhood. [BYN97] (Beaze-Yates, Navarro) dictionary. Lossy search [AG90] (Altschul, Gish) BLAST. FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree
9
9 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion
10
10 Frequency Vector Let s be a string from the alphabet ={ 1,..., }. Let n i be the number of occurrences of the character i in s for 1 i , then frequency vector: f(s) =[n 1,..., n ]. Example: s = AATGATAG f(s) = [n A, n C, n G, n T ] = [4, 0, 2, 2]
11
11 Effect of Edit Operations on Frequency Vector Delete : decreases an entry by 1. Insert : increases an entry by 1. Replace : Insert + Delete Example: s = AATGATAG => f(s) = [4, 0, 2, 2] (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] (A C), s = ACCTATAG => f(s) = [3, 2, 1, 2]
12
12 An Approximation to ED: Frequency Distance (FD 1 ) s = AATGATAG => f(s)=[4, 0, 2, 2] q = ACTTAGC => f(q)=[2, 2, 1, 2] pos = (4-2) + (2-1) = 3 neg = (2-0) = 2 FD 1 (f(s),f(q)) = 3 ED(q,s) = 4 FD 1 (f(s 1 ),f(s 2 ))=max{pos,neg}. FD 1 (f(s 1 ),f(s 2 )) ED(s 1,s 2 ). f(q) FD 1 (f(q),f(s)) f(s)
13
13 An Illustration of Frequency Distance & Edit Distance Frequency Distance Set of strings 1 Set of strings 2 v1v1 v2v2 Edit Distance
14
14 Using Local Information: Wavelet Decomposition of Strings s = AATGATAC => f(s)=[4, 1, 1, 2] s = AATG + ATAC = s 1 + s 2 f(s 1 ) = [2, 0, 1, 1] f(s 2 ) = [2, 1, 0, 1] 1 (s)= f(s 1 )+f(s 2 ) = [4, 1, 1, 2] 2 (s)= f(s 1 )-f(s 2 ) = [0, -1, 1, 0]
15
15 Wavelet Decomposition of a String: General Idea A i,j = f(s(j2 i : (j+1)2 i -1)) B i,j = A i-1,2j - A i-1,2j+1 (s)= First wavelet coefficient Second wavelet coefficient
16
16 Wavelet Decomposition & ED Define FD(s 1,s 2 )=max{FD 1, FD 2 }.
17
17 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN and range queries Experimental results Conclusion
18
18 MRS-Index Structure Creation w=2 a transform s1s1
19
19 MRS-Index Structure Creation s1s1
20
20 MRS-Index Structure Creation s1s1
21
21 MRS-Index Structure Creation... s1s1 slide c times c=box capacity
22
22 MRS-Index Structure Creation s1s1...
23
23 MRS-Index Structure Creation... T a,1 s1s1 W=2 a
24
24 Using Different Resolutions... T a,1 s1s1 W=2 a... T a+1,1 W=2 a+1
25
25 MRS-Index Structure
26
26 MRS-index properties Relative MBR volume (Precision) decreases when c increases. w decreases. MBRs are highly clustered. Box volume Box Capacity
27
27 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion
28
28 Range Queries [KS01] 208 1664128... w=2 4... w=2 5... w=2 6... w=2 7... s1s1 s2s2 sdsd 1=1= 2 12 1 3 23 2
29
29 k-Nearest Neighbor Query [KSF+96, SK98] k = 3
30
30 k-Nearest Neighbor Query k = 3 r = Edit distance to 3 rd closest substring
31
31 k-Nearest Neighbor Query k = 3 r
32
32 k-Nearest Neighbor Query k = 3
33
33 Outline Motivation & background Our contribution Experimental results Conclusion
34
34 Experimental Settings w={128, 256, 512, 1024}. Human chromosomes from ( www.ncbi.nlm.nih.gov ) www.ncbi.nlm.nih.gov chr02, chr18, chr21, chr22 Plotted results are from chr18 dataset. Queries are selected from data set randomly for 512 |q| 10000. An NFA based technique [BYN99] is implemented for comparison.
35
35 Experimental Results 1: Effect of Box Capacity (10-NN)
36
36 Experimental Results 2: Effect of Window Size (10-NN)
37
37 Experimental Results 3: k-NN queries
38
38 Experimental Results 4: Range Queries
39
39 Outline Motivation & background Our Contribution Experimental results Discussion & conclusion
40
40 Discussion In-memory (index size is 1-2% of the database size). Lossless search. 3 to 45 times faster than NFA technique for k- NN queries. 2 to 12 times faster than NFA technique for range queries. Can be used to speedup any previously defined technique.
41
41 Future Work Extend to weighted edit distance and affine gaps. Extend to local similarity (substring/substring) search. Compare the quality of answers and speed to BLAST (lossy search). Use as a preprocessing step to BLAST. Apply the MRS index structure for larger alphabet size (e.g. protein sequences.).
42
42 Related Work Lossless search Online [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius. [WM92] (Wu, Manber) binary masks, O(rn). [BYN99] (Beaze-Yates, Navarro) NFA Offline (index based) [Mye94] (Myers) condensed r-neighborhood. [BYN97] (Beaze-Yates, Navarro) dictionary. Lossy search [AG90] (Altschul, Gish) BLAST. FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree
43
43 Related Work (Similar problems) [BYP92] (Beaze-Yates, Perleberg) only replace is allowed. [Gus97] (Gusfield) exact matching, suffix trees. [JKS00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree.
44
44 THANK YOU
45
45 Frequency Distance to an MBR f(q) FD(f(q),f(s)) f(s) f(q) FD(f(q),B) B
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.