Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.


Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp Washington, DC. March 2003.

BMI Winter'042 Overview Applications of queries Background on queries Current problem Solutions and our solution Comparison experiments and results Future work

BMI Winter'043 Queries in general We need a metric distance function –To measure the (dis)similarity btw objects Dynamic programming Algorithm –O( |string 1 | * |string 2 | ) time and space i.e. O(n 2 ) where n is length of the strings –Especially bad for genetic sequence queries where you have long sequences

BMI Winter'044 2 kinds of queries  -range queries –Retrieve all objects similar to query more than a certain degree  

BMI Winter'045 2 kinds of queries k-nearest neighbor (k-NN) queries –Retrieve k most similar objects No domain knowledge necessary Ex: 4 NN 

BMI Winter'046 2 kinds of queries  -range queries Requires domain knowledge –Data distribution & Distance definition   too small None returned

BMI Winter'047 2 kinds of queries  -range queries   too large All returned

BMI Winter'048 Measuring similarity We need a metric distance function –To measure the (dis)similarity btw objects Edit Distance (ED) –Three kinds of operations Insert, delete, replace –ACTTAGC to AATGATAG –A C T - - T A G C R I I D  ED = 4 A A T G A T A G - – Dynamic programming Algorithm – O(mn) time and space

BMI Winter'049 DPA

BMI Winter'0410 DPA 2

BMI Winter'0411 String/Genome Data Asks the most similar substrings in the database to the given string. BLAST has  -range queries –Naïve search (linear scan) –scalability problems How to Handle Size –Partial information rather than whole database Approximate the string data (compress)  may fit in memory  may be used for indexing, clustering

BMI Winter'0412 How to Handle Size 3 approaches to make use of compressed data 1.Prune irrelevant data, I/O for non-pruned entries  calculate exact values for non-pruned (especially  -range queries) 2.Get approximate answers, virtually no I/O (I/O only for answers) (especially k-NN queries) 3.Approximate pruning for  -range queries

BMI Winter'0413 Overview Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work

BMI Winter'0414 Big Picture General Approach step by step Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors Develop a distance function df in vector spaces to approximate the string similarity Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing- Implement one of the three approaches mentioned -Query-

BMI Winter'0415 String Database Overlapping Windows Windowing 1 Multidimentional Vectors Indexed with respect to some distance function Transformation Into vector Space Indexing 3 2 Preprocessing

BMI Winter'0416 Index of vectors Transformation Approximate Query (k-NN or  -range ) Query sequence 1 Index of vectors Exact Query (k-NN or  -range ) 2a 2b Done The vectors returned represent most of k-NN (or vectors in  - range ) + some false positives Candidate set Using the index Continued 

BMI Winter'0417 Calculate ED for each of them. (Remove false positives.) Refine I/O for strings represented by those vectors. 3 Candidate set Using the index

BMI Winter' ST Step: Partitioning into overlapping Windows AACCGGTTACGTACGT… e.g W=6 e.g  =2

BMI Winter' ND Step: Mapping Windows into Vector Space Choose a tuple size k Associate an int to each 4 k k-tuples Frequencies of those k-tuples, is the vector If k=2  4 k= 16 k-tuples AA, AC, AG, AT, CA, CC, CG, CT TA, TC, TG, TT GA, GC, GG, GT

BMI Winter'0420 Example Mapping The integers assigned AA=0, AC=1, AG=2, AT=3, CA=4, CC=5, CG=6, CT=7 TA=8, TC=9, TG=10, TT=11 GA=12, GC=13, GG=14, GT=15 Assume window AACCGG AA, AC, CC, CG, GG all occur once is the matching vector.

BMI Winter'0421 Different transformations & Distance Functions Tuple size  transformation size –1  4 (frequencies of A, C, G, T) FV 1 –2  16 (frequencies of 2-tuples)FV 2

BMI Winter'0422 Different transformations & Distance Functions 2 WV n transformation –String into halves x,y –FV n s for x,y  FV x,FV y –Concatenate addition and subtraction of them [ FV x + FV y, FV x -FV y ] Wavelet 1 on example –TCACTTAG –1 st : divide into halves & find FV 1 transformation x:TCAC  y:TTAG  –2 nd : add and subtract –1 –1 WV 1 Same operations on 2- tuples WV 2

BMI Winter'0423 Distance Functions on the Vector Spaces All of them are proved to be lower-bounds to edit-distance FD 1  distance on FV 1 FD 2  distance on FV 2 WD 1  distance on WV 1 WD 2  distance on WV 2

BMI Winter'0424 Frequency Distance FD n AlgorithmExample (n=1) FD n (n-gram frequencies u,v) posDist:=negDist:=0 for all dimensions u i,v i –If u i >v i then posDist:=u i -v i –else negDist:=u i -v i Return max(posDist, negDist)/n u:ACTTAGC  2,2,1,2 v:AATGATAG  4,0,2,2 – 2-4<0 negDist+=|2-4| –2-0>0 posDist+=|2-0| –1-2<0 negDist+=|1-2| –2-2=0 posDist:2 negDist:3 FD 1 is 3

BMI Winter'0425 FD n Why lower bound? On example –need to incresase A by 2 G by 1  3 –need to decrease c by 2 We may “increase+decrease” if we can replace (back to slide #8) So in best case edit dist is only FD 1 But it may not be the case, you may need more operations, because of mismatch of locations… Divide by n is because a change in one character, updates frequency of n n-grams.

BMI Winter'0426 Wavelet Distance WD n AlgorithmExample (n=1) WD n (n-gram frequency wavelets u,v) Find posDist and negDist on u,v m:=min(posDist, negDist) d:= (posDist-negDist)/2 if m < d –Return d / n else –Return (d + (m-d )/2 )/n u:ACTC TAGC  –1 0 v:AATG ATAG  posDist: = 4 negDist: = 4 m:4 d:0 (0 + 4/2)/1 Return 2

BMI Winter'0427 WD n Why lower bound? Assume a string transformed into wavelet [a 1,…a , b 1,…b  ] Largest change posDist+=3 negDist-=1 or vice versa –So use this change whenever posDist<>negDist

BMI Winter'0428 Overview Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work

BMI Winter'0429 Experiment Design Implemented transformations & distance functions Evaluated their pruning efficiency on  -range queries and approximation efficiency on k-NN queries experimentally on real genetic data Ran queries with different parameters –Varying string size W, shift amount  –Some containing exact match, some not –For  -range queries different  values –For k-NN queries different k values

BMI Winter'0430

BMI Winter'0431

BMI Winter'0432 Sorted Graphs To depict why our distance functions perform so good in k-NN Imitate what our k-NN approximation does, and graph the result –It sorts the data values in increasing order, and takes the k-nearest ones

BMI Winter' nearest 50 nearest

BMI Winter' nearest 50 nearest

BMI Winter'0435 Nature of the distance functions WD2 has very good performance in k-NN even though not so well pruning –Its variance of its ratio to edit distance is much lower than others as you would like for a distance function

BMI Winter'0436

BMI Winter'0437

BMI Winter'0438 Results Tested the parameters obtained by this random experiments, on real data. Then also did the parameter extraction using real data too.

BMI Winter'0439 Comparison of index structures

BMI Winter'0440 Future Work Check applicability of those methods to other kinds of sequence data. –Text –Image search Implement index structure in the standalone program, and make performance evaluation