Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.

1 Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366. Washington, DC. March 2003.

Overview Applications of queries Background on queries Current problem Solutions and our solution Comparison experiments and results Future work

Queries in general We need a metric distance function –To measure the (dis)similarity btw objects Dynamic programming Algorithm –O( |string 1 | * |string 2 | ) time and space i.e. O(n 2 ) where n is length of the strings –Especially bad for genetic sequence queries where you have long sequences

2 kinds of queries  -range queries –Retrieve all objects similar to query more than a certain degree

2 kinds of queries k-nearest neighbor (k-NN) queries –Retrieve k most similar objects No domain knowledge necessary Ex: 4 NN

2 kinds of queries  -range queries Requires domain knowledge –Data distribution & Distance definition   too small None returned

2 kinds of queries  -range queries   too large All returned

Measuring similarity We need a metric distance function –To measure the (dis)similarity btw objects Edit Distance (ED) –Three kinds of operations Insert, delete, replace –ACTTAGC to AATGATAG –A C T - - T A G C R I I D  ED = 4 A A T G A T A G - – Dynamic programming Algorithm – O(mn) time and space

DPA

DPA 2

String/Genome Data Asks the most similar substrings in the database to the given string. BLAST has  -range queries –Naïve search (linear scan) –scalability problems How to Handle Size –Partial information rather than whole database Approximate the string data (compress)  may fit in memory  may be used for indexing, clustering

How to Handle Size 3 approaches to make use of compressed data 1.Prune irrelevant data, I/O for non-pruned entries  calculate exact values for non-pruned (especially  -range queries) 2.Get approximate answers, virtually no I/O (I/O only for answers) (especially k-NN queries) 3.Approximate pruning for  -range queries

Overview Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work

Big Picture General Approach step by step Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors Develop a distance function df in vector spaces to approximate the string similarity Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing- Implement one of the three approaches mentioned -Query-

String Database Overlapping Windows Windowing 1 Multidimentional Vectors Indexed with respect to some distance function Transformation Into vector Space Indexing 3 2 Preprocessing

Index of vectors Transformation Approximate Query (k-NN or  -range ) Query sequence 1 Index of vectors Exact Query (k-NN or  -range ) 2a 2b Done The vectors returned represent most of k-NN (or vectors in  - range ) + some false positives Candidate set Using the index Continued

Calculate ED for each of them. (Remove false positives.) Refine I/O for strings represented by those vectors. 3 Candidate set Using the index

1 ST Step: Partitioning into overlapping Windows AACCGGTTACGTACGT… e.g W=6 e.g  =2

2 ND Step: Mapping Windows into Vector Space Choose a tuple size k Associate an int to each 4 k k-tuples Frequencies of those k-tuples, is the vector If k=2  4 k= 16 k-tuples AA, AC, AG, AT, CA, CC, CG, CT TA, TC, TG, TT GA, GC, GG, GT

Example Mapping The integers assigned AA=0, AC=1, AG=2, AT=3, CA=4, CC=5, CG=6, CT=7 TA=8, TC=9, TG=10, TT=11 GA=12, GC=13, GG=14, GT=15 Assume window AACCGG AA, AC, CC, CG, GG all occur once 1100011000100000 is the matching vector.

Different transformations & Distance Functions Tuple size  transformation size –1  4 (frequencies of A, C, G, T) FV 1 –2  16 (frequencies of 2-tuples)FV 2

Different transformations & Distance Functions 2 WV n transformation –String into halves x,y –FV n s for x,y  FV x,FV y –Concatenate addition and subtraction of them [ FV x + FV y, FV x -FV y ] Wavelet 1 on example –TCACTTAG –1 st : divide into halves & find FV 1 transformation x:TCAC  1 2 0 1 y:TTAG  1 0 1 2 –2 nd : add and subtract 2 2 1 3 0 2 –1 –1 WV 1 Same operations on 2- tuples WV 2

Distance Functions on the Vector Spaces All of them are proved to be lower-bounds to edit-distance FD 1  distance on FV 1 FD 2  distance on FV 2 WD 1  distance on WV 1 WD 2  distance on WV 2

Frequency Distance FD n AlgorithmExample (n=1) FD n (n-gram frequencies u,v) posDist:=negDist:=0 for all dimensions u i,v i –If u i >v i then posDist:=u i -v i –else negDist:=u i -v i Return max(posDist, negDist)/n u:ACTTAGC  2,2,1,2 v:AATGATAG  4,0,2,2 – 2-4<0 negDist+=|2-4| –2-0>0 posDist+=|2-0| –1-2<0 negDist+=|1-2| –2-2=0 posDist:2 negDist:3 FD 1 is 3

FD n Why lower bound? On example –need to incresase A by 2 G by 1  3 –need to decrease c by 2 We may "increase+decrease" if we can replace (back to slide #8) So in best case edit dist is only FD 1 But it may not be the case, you may need more operations, because of mismatch of locations… Divide by n is because a change in one character, updates frequency of n n-grams.

Wavelet Distance WD n AlgorithmExample (n=1) WD n (n-gram frequency wavelets u,v) Find posDist and negDist on u,v m:=min(posDist, negDist) d:= (posDist-negDist)/2 if m < d –Return d / n else –Return (d + (m-d )/2 )/n u:ACTC TAGC 1201 1111  2 3 1 2 0 1 –1 0 v:AATG ATAG 2011 2011  4 0 2 2 0 0 0 0 posDist: 3 + 1 = 4 negDist: 2 + 1 + 1 = 4 m:4 d:0 (0 + 4/2)/1 Return 2

WD n Why lower bound? Assume a string transformed into wavelet [a 1,…a , b 1,…b  ] Largest change posDist+=3 negDist-=1 or vice versa –So use this change whenever posDist<>negDist

Overview Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work

Experiment Design Implemented transformations & distance functions Evaluated their pruning efficiency on  -range queries and approximation efficiency on k-NN queries experimentally on real genetic data Ran queries with different parameters –Varying string size W, shift amount  –Some containing exact match, some not –For  -range queries different  values –For k-NN queries different k values

30 BMI 731 - Winter'0430

31 BMI 731 - Winter'0431

Sorted Graphs To depict why our distance functions perform so good in k-NN Imitate what our k-NN approximation does, and graph the result –It sorts the data values in increasing order, and takes the k-nearest ones

33 BMI 731 - Winter'0433 20 nearest 50 nearest

34 BMI 731 - Winter'0434 20 nearest 50 nearest

Nature of the distance functions WD2 has very good performance in k-NN even though not so well pruning –Its variance of its ratio to edit distance is much lower than others as you would like for a distance function

36 BMI 731 - Winter'0436

37 BMI 731 - Winter'0437

Results Tested the parameters obtained by this random experiments, on real data. Then also did the parameter extraction using real data too.

Comparison of index structures

Future Work Check applicability of those methods to other kinds of sequence data. –Text –Image search Implement index structure in the standalone program, and make performance evaluation

