Presentation is loading. Please wait.

Presentation is loading. Please wait.

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.

Similar presentations


Presentation on theme: "Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics."— Presentation transcript:

1 Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366. Washington, DC. March 2003.

2 BMI 731 - Winter'042 Overview Applications of queries Background on queries Current problem Solutions and our solution Comparison experiments and results Future work

3 BMI 731 - Winter'043 Queries in general We need a metric distance function –To measure the (dis)similarity btw objects Dynamic programming Algorithm –O( |string 1 | * |string 2 | ) time and space i.e. O(n 2 ) where n is length of the strings –Especially bad for genetic sequence queries where you have long sequences

4 BMI 731 - Winter'044 2 kinds of queries  -range queries –Retrieve all objects similar to query more than a certain degree  

5 BMI 731 - Winter'045 2 kinds of queries k-nearest neighbor (k-NN) queries –Retrieve k most similar objects No domain knowledge necessary Ex: 4 NN 

6 BMI 731 - Winter'046 2 kinds of queries  -range queries Requires domain knowledge –Data distribution & Distance definition   too small None returned

7 BMI 731 - Winter'047 2 kinds of queries  -range queries   too large All returned

8 BMI 731 - Winter'048 Measuring similarity We need a metric distance function –To measure the (dis)similarity btw objects Edit Distance (ED) –Three kinds of operations Insert, delete, replace –ACTTAGC to AATGATAG –A C T - - T A G C R I I D  ED = 4 A A T G A T A G - – Dynamic programming Algorithm – O(mn) time and space

9 BMI 731 - Winter'049 DPA

10 BMI 731 - Winter'0410 DPA 2

11 BMI 731 - Winter'0411 String/Genome Data Asks the most similar substrings in the database to the given string. BLAST has  -range queries –Naïve search (linear scan) –scalability problems How to Handle Size –Partial information rather than whole database Approximate the string data (compress)  may fit in memory  may be used for indexing, clustering

12 BMI 731 - Winter'0412 How to Handle Size 3 approaches to make use of compressed data 1.Prune irrelevant data, I/O for non-pruned entries  calculate exact values for non-pruned (especially  -range queries) 2.Get approximate answers, virtually no I/O (I/O only for answers) (especially k-NN queries) 3.Approximate pruning for  -range queries

13 BMI 731 - Winter'0413 Overview Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work

14 BMI 731 - Winter'0414 Big Picture General Approach step by step Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors Develop a distance function df in vector spaces to approximate the string similarity Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing- Implement one of the three approaches mentioned -Query-

15 BMI 731 - Winter'0415 String Database Overlapping Windows Windowing 1 Multidimentional Vectors Indexed with respect to some distance function Transformation Into vector Space Indexing 3 2 Preprocessing

16 BMI 731 - Winter'0416 Index of vectors Transformation Approximate Query (k-NN or  -range ) Query sequence 1 Index of vectors Exact Query (k-NN or  -range ) 2a 2b Done The vectors returned represent most of k-NN (or vectors in  - range ) + some false positives Candidate set Using the index Continued 

17 BMI 731 - Winter'0417 Calculate ED for each of them. (Remove false positives.) Refine I/O for strings represented by those vectors. 3 Candidate set Using the index

18 BMI 731 - Winter'0418 1 ST Step: Partitioning into overlapping Windows AACCGGTTACGTACGT… e.g W=6 e.g  =2

19 BMI 731 - Winter'0419 2 ND Step: Mapping Windows into Vector Space Choose a tuple size k Associate an int to each 4 k k-tuples Frequencies of those k-tuples, is the vector If k=2  4 k= 16 k-tuples AA, AC, AG, AT, CA, CC, CG, CT TA, TC, TG, TT GA, GC, GG, GT

20 BMI 731 - Winter'0420 Example Mapping The integers assigned AA=0, AC=1, AG=2, AT=3, CA=4, CC=5, CG=6, CT=7 TA=8, TC=9, TG=10, TT=11 GA=12, GC=13, GG=14, GT=15 Assume window AACCGG AA, AC, CC, CG, GG all occur once 1100011000100000 is the matching vector.

21 BMI 731 - Winter'0421 Different transformations & Distance Functions Tuple size  transformation size –1  4 (frequencies of A, C, G, T) FV 1 –2  16 (frequencies of 2-tuples)FV 2

22 BMI 731 - Winter'0422 Different transformations & Distance Functions 2 WV n transformation –String into halves x,y –FV n s for x,y  FV x,FV y –Concatenate addition and subtraction of them [ FV x + FV y, FV x -FV y ] Wavelet 1 on example –TCACTTAG –1 st : divide into halves & find FV 1 transformation x:TCAC  1 2 0 1 y:TTAG  1 0 1 2 –2 nd : add and subtract 2 2 1 3 0 2 –1 –1 WV 1 Same operations on 2- tuples WV 2

23 BMI 731 - Winter'0423 Distance Functions on the Vector Spaces All of them are proved to be lower-bounds to edit-distance FD 1  distance on FV 1 FD 2  distance on FV 2 WD 1  distance on WV 1 WD 2  distance on WV 2

24 BMI 731 - Winter'0424 Frequency Distance FD n AlgorithmExample (n=1) FD n (n-gram frequencies u,v) posDist:=negDist:=0 for all dimensions u i,v i –If u i >v i then posDist:=u i -v i –else negDist:=u i -v i Return max(posDist, negDist)/n u:ACTTAGC  2,2,1,2 v:AATGATAG  4,0,2,2 – 2-4<0 negDist+=|2-4| –2-0>0 posDist+=|2-0| –1-2<0 negDist+=|1-2| –2-2=0 posDist:2 negDist:3 FD 1 is 3

25 BMI 731 - Winter'0425 FD n Why lower bound? On example –need to incresase A by 2 G by 1  3 –need to decrease c by 2 We may “increase+decrease” if we can replace (back to slide #8) So in best case edit dist is only FD 1 But it may not be the case, you may need more operations, because of mismatch of locations… Divide by n is because a change in one character, updates frequency of n n-grams.

26 BMI 731 - Winter'0426 Wavelet Distance WD n AlgorithmExample (n=1) WD n (n-gram frequency wavelets u,v) Find posDist and negDist on u,v m:=min(posDist, negDist) d:= (posDist-negDist)/2 if m < d –Return d / n else –Return (d + (m-d )/2 )/n u:ACTC TAGC 1201 1111  2 3 1 2 0 1 –1 0 v:AATG ATAG 2011 2011  4 0 2 2 0 0 0 0 posDist: 3 + 1 = 4 negDist: 2 + 1 + 1 = 4 m:4 d:0 (0 + 4/2)/1 Return 2

27 BMI 731 - Winter'0427 WD n Why lower bound? Assume a string transformed into wavelet [a 1,…a , b 1,…b  ] Largest change posDist+=3 negDist-=1 or vice versa –So use this change whenever posDist<>negDist

28 BMI 731 - Winter'0428 Overview Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work

29 BMI 731 - Winter'0429 Experiment Design Implemented transformations & distance functions Evaluated their pruning efficiency on  -range queries and approximation efficiency on k-NN queries experimentally on real genetic data Ran queries with different parameters –Varying string size W, shift amount  –Some containing exact match, some not –For  -range queries different  values –For k-NN queries different k values

30 BMI 731 - Winter'0430

31 BMI 731 - Winter'0431

32 BMI 731 - Winter'0432 Sorted Graphs To depict why our distance functions perform so good in k-NN Imitate what our k-NN approximation does, and graph the result –It sorts the data values in increasing order, and takes the k-nearest ones

33 BMI 731 - Winter'0433 20 nearest 50 nearest

34 BMI 731 - Winter'0434 20 nearest 50 nearest

35 BMI 731 - Winter'0435 Nature of the distance functions WD2 has very good performance in k-NN even though not so well pruning –Its variance of its ratio to edit distance is much lower than others as you would like for a distance function

36 BMI 731 - Winter'0436

37 BMI 731 - Winter'0437

38 BMI 731 - Winter'0438 Results Tested the parameters obtained by this random experiments, on real data. Then also did the parameter extraction using real data too.

39 BMI 731 - Winter'0439 Comparison of index structures

40 BMI 731 - Winter'0440 Future Work Check applicability of those methods to other kinds of sequence data. –Text –Image search Implement index structure in the standalone program, and make performance evaluation


Download ppt "Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics."

Similar presentations


Ads by Google