Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1.

Similar presentations


Presentation on theme: "An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1."— Presentation transcript:

1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1

2 2 Issue ? Find similar substrings in a large database, that is similar to a given query string quickly, using a small index structure In some applications we store, search and analyze long sequences of discrete characters, which we call “strings” There is a frequent need to find similarities between genetic data, web data and event sequences.

3 3 Applications ? Information Retrieval : A typical application of information retrieval is text searching; given a large collection of documents and some text keywords we want to find the documents which contain these keywords. searching keywords through the net: usually by “mtallica” we mean “metallica”:

4 Computational Biology : The problem is similar in computational biology; here we have a long DNA sequence and we want to find subsequences in it that match approximately a query sequence. …ATGCATACGATCGATT… …TGCAATGGCTTAGCTA… Animal species from the same family are bound to have more similar DNAs

5 5 Video data can be viewed as an event sequence if some pre-specified set of events are detected and stored as a sequence. Searching similar event subsequences can be used to find related video segments.

6 6 String search algorithms proposed so far are in-memory algorithms. Scan the whole database for each query. Size of the string database grows faster than the available memory capacity, and extensive memory requirements make the search techniques impractical. Suffer from disk I/Os when the database is too large Performance deteriorates for long query patterns

7 7 Similarity Metrics The difference between two strings s1 and s2 is generally defined as the minimum number of edit operations to transform s1 to s2 called “edit distance ED”. Edit operations: – Insert – Delete – Replace

8 Suppose we have two strings x,y e.g. x = kitten, y = sitting and we want to transform x into y. A closer look: k i t t e n s i t t i n g 1 st step: kitten  sitten (Replace) 2 nd step: sitten  sittin (Replace) 3 rd step: sittin  sitting (Insert)s What is the edit distance between “survey” and “surgery”? s u r v e y---> s u r g e yreplace (+1) --->s u r g e r yinsert (+1) Edit distance = 2

9 In the general version of edit distance, different operations may have different costs, or the costs depend on the characters involved. For example replacement could be more expensive than insertion, or replacing “a” with “o” could be less expensive than replacing “a” with “k”. This is called as weighted edit distance.

10 Global Alignment Global alignment (or similarity) of s1 and s2 is defined as the maximum valued alignment of s1 and s2. – Given two strings S 1 and S 2, the global alignment of them is obtained by inserting spaces into S 1 or S 2 and at the ends so that are of the same length and then writing them one against the other Example – qacdbd & qawdb qac_dbd qa_wdb_ Edits and alignments are dual. – A sequence of edits can be converted into a global alignment. – An alignment can be converted into a sequence of edits

11 Local Alignment Given two strings X and Y find two substrings x and y from X and Y, respectively, such that their alignment score (in the global sense) is maximum over all pairs of such substrings. (empty substrings are allowed) S(x,y) = +2, x = y -2, x != y -1, x = ‘_’ or y = ‘_’ X= pqraxabcstvq Y= yxaxbacsll x= axabcs y= axbacs a x a b _ c s a x _ b a c s +2+2-1+2-1+2+2=+8

12 String Matching Problem Whole Matching : finding the edit distance ED(q,s) between a data string s and a query string q. Substring Matching : Consider all substrings s[i:j] of s which are close to the query string. Two Types of Queries : Range search seeks all the substrings of S which are within an edit distance of r to a given query q (r = range query) K-nearest neighbor search seeks the K closest substrings of S to q.

13 Challenges in solving the substring matching problem Finding the edit distance is very costly in terms of both time and space. The strings in the database may be very long. The database size for most applications grows exponentially. New approach to overcome challenges Define a lower bound distance for substring searching Improve this lower bound by using the idea of wavelet transformation Use the MRS index structure based on the aforementioned distance formulations

14 A dynamic programming algorithm for computing the edit distance Problem: find the edit distance between strings x and y. Create a (|x|+1)×(|y|+1) matrix C, where Ci,j represents the minimum number of operations to match x1..i with y1..j. The matrix is constructed as follows. Ci,0 = I C0,j = j Ci,j = min{(Ci-1,j-1)+cost, replace (Ci,j-1)+1,insert (Ci-1,j)+1}delete cost = 0 if xi=yi, else 1

15 How do we perform substring search? The same dynamic programming algorithm can be used to find the most similar substrings of a query sting q. The difference is that we set C0,j=0 for all j, since any text position could be the potential start of a match. If the similarity distance bound is k, we report all positions, where Cm ≤k (m is the last row – m = |q|).

16 16 Frequency Vector Let s be a string from the alphabet  ={  1,...,   }. Let n i be the number of occurrences of the character  i in s for 1  i , then frequency vector: f(s) =[n 1,..., n  ]. Example: – s = AATGATAG – f(s) = [n A, n C, n G, n T ] = [4, 0, 2, 2] Let s be a string from the alphabet  ={  1,...,   }. Let f(s) =[v 1,..., v  ], be the frequency vector of s then   i-1 v i = |s|. An edit operation on s has one of the following effects on f(s), for 1  i, j  , and i != j : – v i := v i + 1 – v i := v i - 1 – v i := v i + 1and v j := v j - 1

17 17 Effect of Edit Operations on Frequency Vector Delete : decreases an entry by 1. Insert : increases an entry by 1. Replace : Insert + Delete Example: – s = AATGATAG => f(s) = [4, 0, 2, 2] – (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] – (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] – (A  C), s = ACCTATAG => f(s) = [3, 2, 1, 2]

18 18 Frequency Distance Let u and v be integer points in  dimensional space. The frequency distance, FD 1 (u,v) between u and v is defined as the minimum number of steps in order to go from u to v ( or equivalently from v to u) by moving to a neighbor point at each step. frequency vector: f(s) =[n 1,..., n  ]. Let s 1 and s 2 be two strings from the alphabet  ={  1,...,   } then – FD 1 (f(s 1 ), f(s 2 ))  ED (s 1,s 2 )

19 19 An Approximation to ED: Frequency Distance (FD 1 ) s = AATGATAG => f(s)=[4, 0, 2, 2] q = ACTTAGC => f(q)=[2, 2, 1, 2] – pos = (4-2) + (2-1) = 3 – neg = (2-0) = 2 – FD 1 (f(s),f(q)) = 3 – ED(q,s) = 4 FD 1 (f(s 1 ),f(s 2 ))=max{pos,neg}. FD 1 (f(s 1 ),f(s 2 ))  ED(s 1,s 2 ). f(q) FD 1 (f(q),f(s)) f(s)

20 Frequency Distance Calculation /* u and v are  dimensional integer points */ Algorithm : FD 1 (u,v) posDistance := negDistance := 0 For i := 1 to  FD 1 ( u, v ) = max { posDist, negDist }

21 Wavelet Vector Computation Let s = c 1 c 2… c n be a string from the alphabet  ={  1,...,   } then Kth level wavelet transformation,  k (s), 0 <k< log 2 n of s is defined as:  k (s) = [v k,1,..., v k,n/ 2 k ] where v k,I = [A k,i, B k,i ], f (c i ) k = 0 A k-1,2i + A k-1,2i+1 0 < k < log 2 n 0 k = 0 A k-1,2i - A k-1,2i+1 0 < k < log 2 n 0<i<(n/2 k )-1 A k,i = B k,i =

22 22 Using Local Information: Wavelet Decomposition of Strings s = AATGATAC => f(s)=[4, 1, 1, 2] s = AATG + ATAC = s 1 + s 2 f(s 1 ) = [2, 0, 1, 1] f(s 2 ) = [2, 1, 0, 1]  1 (s)= f(s 1 )+f(s 2 ) = [4, 1, 1, 2]  2 (s)= f(s 1 )-f(s 2 ) = [0, -1, 1, 0]

23 23 Wavelet Decomposition of a String: General Idea A i,j = f(s(j2 i : (j+1)2 i -1)) B i,j = A i-1,2j - A i-1,2j+1  (s)= First wavelet coefficient Second wavelet coefficient

24 Wavelet Transformation: Example s = T C A C n = |s| = 4  0 (s) = [v 0,0, v 0,1, v 0,2, v 0,3 ] = [ (A 0,0, B 0,0 ), (A 0,1, B 0,1 ), (A 0,2, B 0,2 ), (A 0,3, B 0,3 ) ] = [ (f(t), 0), (f(c), 0), (f(a), 0), (f(c), 0) ] = [( [0,0,1], 0 ), ( [0,1,0], 0 ), ( [1,0,0], 0 ), ( [0,1,0], 0 ) ]  1 (s) = [ ([0,1,1], [0,-1,1]), ([1,1,0], [1,-1,0]) ]  2 (s) = [ ( [1,2,1], [-1,0,1] )] Second wavelet coefficient First wavelet coefficient

25 Wavelet Distance Calculation

26 Maximum Frequency Distance Calculation FD(s 1,s 2 ) = max { FD 1 (f (s 1 ), f (s 2 )), FD 2 (ψ(s 1 ),ψ(s 2 )) } FD 1 is the Frequency Distance FD 2 is the Wavelet Distance

27 27 MRS-Index Structure Creation w=2 a transform s1s1

28 28 s1s1 MRS-Index Structure Creation

29 29 s1s1 MRS-Index Structure Creation

30 30... s1s1 slide c times c=box capacity MRS-Index Structure Creation

31 31 s1s1... MRS-Index Structure Creation

32 32... T a,1 s1s1 W=2 a MRS-Index Structure Creation

33 33 Using Different Resolutions... T a,1 s1s1 W=2 a... T a+1,1 W=2 a+1

34 34 MRS-Index Structure

35 35 MRS-index properties Relative MBR volume (Precision) decreases when – c increases. – w decreases. MBRs are highly clustered. Box volume Box Capacity

36 36 Frequency Distance to an MBR Let q be the query string of length 2 i where a <= i <= a + l - 1. Given an MBR B, we define FD(q,B)= min (s belongs to B) FD(q,s) f(q) FD(f(q),f(s)) f(s) f(q) FD(f(q),B) B

37 Range Search Algorithm

38 Range Queries 208 1664128... w=2 4... w=2 5... w=2 6... w=2 7... s1s1 s2s2 sdsd 1=1= 2 12 1 3 23 2 q q1q1 q2q2 q3q3 1. Partition the query string into subqueries at various resolutions available in our index. 2. Perform a partial range query for each subquery on the corresponding row of the index structure, and refine ε. 3. Disk pages corresponding to last result set are read, and postprocessing is done to elminate false retrievals.

39 K-Nearest Neighbor Algorithm

40 40 k-Nearest Neighbor Query k = 3

41 41 k-Nearest Neighbor Query k = 3

42 42 k-Nearest Neighbor Query [KSF+96, SK98] k = 3

43 43 k-Nearest Neighbor Query k = 3 r = Edit distance to 3 rd closest substring r

44 44 Experimental Settings w={128, 256, 512, 1024}. Human chromosomes from ( www.ncbi.nlm.nih.gov ) www.ncbi.nlm.nih.gov – chr02, chr18, chr21, chr22 – Plotted results are from chr18 dataset. Queries are selected from data set randomly for 512  |q|  10000. An NFA based technique [BYN99] is implemented for comparison.

45 45 Experimental Results 1: Effect of Box Capacity (10-NN ) The cost of the MRS-index increases as the box capacity increases. The cost of the MRS-index is much lower than the NFA technique for all these box capacities. Although using 2-wavelet coefficient slightly improves the performance for the same box capacity, the size of the index structure is doubled. For same amount of memory, the single coefficient version performs better

46 46 Experimental Results 2: Effect of Window Size (10-NN) The MRS-index structure outperforms the NFA technique for all the window sizes. The performance of the MRS index structure itself improves as the window size increases.

47 47 Experimental Results 3: k-NN queries The performance of the MRS-index structure drops for large values of k, it still performs better than the NFA technique. Achieved speedups up to 45 for 10 nearest neighbors. The speedup for 200 nearest neighbors is 3. As the number of nearest neighbors increases, the performance of the MRS-index structure approaches to that of the NFA technique.

48 48 Experimental Results 4: Range Queries The MRS-index structure performed up to 12 times faster than the NFA technique. The performance of the MRS-index structure improved when the queries are selected from different data strings. This is because the DNA strings have a high self similarity. The performance of the MRS index structure deteriorates as the error rate increases. This is because the size of the candidate set increases as the error rate increases.

49 49 Discussion In-memory (index size is 1-2% of the database size). Lossless search. 3 to 45 times faster than NFA technique for k-NN queries. 2 to 12 times faster than NFA technique for range queries. Can be used to speedup any previously defined technique.

50 50 THANK YOU


Download ppt "An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1."

Similar presentations


Ads by Google