Dynamic Programming and Biological Sequence Comparison Part I
\course\eleg f\Topic-2a.ppt2 Topic II – Biological Sequence Alignment and Database Search Part I (Topic-2a): Dynamic programming and Sequence comparison Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment Part III (Topic-2c): Multiple sequence alignment
\course\eleg f\Topic-2a.ppt3 Outline Concept of alignment Two algorithm design techniques; Dynamic Programming: Examples Applying DP to Sequence Comparison; The database search problem Heuristic algorithms to database search
\course\eleg f\Topic-2a.ppt4 Alignment The two sequences will have the same length (after possible insertions of spaces on either or both of them) No space in one sequence can be aligned with a space in the other Spaces can be inserted at the beginning or end of the sequences
\course\eleg f\Topic-2a.ppt5 Biological Sequence Alignment and Database Search 1.We have two sequences over the same alphabet, both about the same length (tens of thousands of characters) and the sequences are almost equal. The average frequency of these differences is low, say, one each hundred characters. We want to find the places where the differences occur. 2.We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there is a prefix of one which is similar to suffix of the other.
\course\eleg f\Topic-2a.ppt6 3.We have the same problem as in (2), but now we have several hundred sequences that must be compared (each one against all). In addition, we know that the great majority of sequence pairs are unrelated, that is, they will not have the required degree of similarity. 4.We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there are two substrings, one from each sequence, that are similar. 5.We have the same problem as in (4), but instead of two sequences we have one sequence that must be compared to thousands of others. (cont’d)
\course\eleg f\Topic-2a.ppt7 Breaking Problems Down: Divide and Conquer: Starting with the complete instance of a problem, divide it into smaller subinstances, solve each of them recursively and combine the partial solutions into a solution to the original problem. Dynamic Programming: Starting with the smallest subinstances of a problem, solve and combine them until the complete instance of the original problem is solved. Two Related Algorithm Design Techniques
\course\eleg f\Topic-2a.ppt8 Divide and Conquer – Example becomes becomes Quick Sort
\course\eleg f\Topic-2a.ppt9 Divide and Conquer – Example 2 The Fibonacci numbers Fib(n) { if (n < 2) return 1; else return Fib(n-1)+Fib(n-2); } F 1 = 1, F 2 = 1 F n = F n-1 + F n-2 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …
\course\eleg f\Topic-2a.ppt10 Divide and Conquer – Example 2 F 1 = 1, F 2 = 1 F n = F n-1 + F n-2 F(7) F(3) + F(2) F(1) F(4) + F(2) F(6) + F(3) + F(2) F(1) F(3) + F(2) F(1) F(4) + F(2) F(5) + + F(3) + F(2) F(1) F(3) + F(2) F(1) F(4) + F(2) F(5) + n … F n … F n / F n-1 1.6 F n 1.6 n, n >> 1 T(n) #Internal_nodes = #leaves - 1 but #leaves = F n T(n) = O(1.6 n ) Exponential Time!
\course\eleg f\Topic-2a.ppt11 How to Compute Fib Function Using Dynamic Programming Method?
\course\eleg f\Topic-2a.ppt12 Dynamic Programming–Example 1 Fib(n) { int tab[n]; tab[1] = 1; tab[2] = 1; for (j = 3; j <= n; j++) tab[j]=tab[j-1] + tab[j-2]; return tab[n]; } Start by solving the smallest problems Use the partial solutions to solve bigger and bigger problems Extra memory to store intermediate values …. tab Linear Time! T(n) = O(n) Space-Time Tradeoff
\course\eleg f\Topic-2a.ppt13 Sequence Comparison Molecular sequence data are at the heart of Computational Biology DNA sequences RNA sequences Protein sequences We can think of these sequences as strings of letters DNA & RNA: alphabet of 4 letters (A,T,C,G) Protein: alphabet of 20 letters code full name A alanine C cysteine D aspartate E glutamate F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N aspartamine P proline Q glutamine R arginine S serine T threonine V valine W tryptophan Y tyrosine
\course\eleg f\Topic-2a.ppt14 Sequence Comparison – (Cont.) Why compare sequences? Find similar genes/proteins Allows to predict function & structure Locate common subsequences in genes/proteins Identify common recurrent patterns Locate sequences that might overlap Help in sequence assembly
\course\eleg f\Topic-2a.ppt15 Sequence X = A T A A G T Sequence Y = A T G C A G T To compare the sequences we need to quantify the similariy matches = 1 mismatches = 0 Score Total = 2 Sequence Comparison – (Cont.)
\course\eleg f\Topic-2a.ppt16 Sequence Y = A T G C A G T Sequence X = A T A A G T Sequence Comparison – (Cont.) Sequence X = A T A A G T Taking positions of the letters into account matches = 1 mismatches = 0 Score Total = 3
\course\eleg f\Topic-2a.ppt17 Sequence Y = A T G C A G T Sequence X = A T A A G T Sequence Comparison – (Cont.) Sequence X = A T A - A G T How to take possible mutations into account? matches = 1 mismatches = 0 gap = -1 Score – Total = 4 matches = 1 mismatches = 0
\course\eleg f\Topic-2a.ppt18 Applying DP to Sequence Comparison Sequence X = GA Sequence Y = AG G - A G - - A GAGA - G A - GA AG GA A GA - A G - A - A - G AG GA A - G - AG - GA A G - A -G - G AG - - G AG - GA AG GA - - AG G - A - AG G - A - - A -G G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG GA AG - - scores T(n,n) = O(k n ) Exponential Time! choose the best score, i.e max(-2, 0, -2) choose the best score, i.e max(-3, 0, -1) choose the best score, i.e max(-1, 0, -3) choose the best score, i.e max(-1, 0, -1) total score = 0
\course\eleg f\Topic-2a.ppt19 G A AGAG Applying DP to Sequence Comparison Sequence X = GA Sequence Y = AG G - A G - - A GAGA - G A - GA AG GA A GA - A G - A - A - G AG GA A - G - AG - GA A G - A -G - G AG - - G AG - GA AG GA - - AG G - A - AG G - A - - A -G G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG GA AG T(n,n) = O(n 2 ) Polynomial Time!
\course\eleg f\Topic-2a.ppt20 Questions Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example? Answer: Let us count Total = 13 G A A G Question: from 1 to 9 how many paths?
\course\eleg f\Topic-2a.ppt21 DP algorithm for Sequence Comparison int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i. g for j = 0 to n do S[j,0] = j. g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] sb[i,j] - Substitution Matrix A T C G ATCGATCG Start by solving the smallest problems Extra memory to store intermediate values Use the partial solutions to solve bigger and bigger problems
\course\eleg f\Topic-2a.ppt22 The Substitution Matrix For DNA we usually use identity matrices; A T C G ATCGATCG For proteins more sensitive matrices, derived empirically, are used; A B C D E F G H I K L M N P Q R S T V W Y Z A B C D E F G H I K L M N P Q R S T V W Y Z
\course\eleg f\Topic-2a.ppt23 Sequence Comparison revisited A T G C A G T ATAAGTATAAGT Similarity Matrix int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i. g for j = 0 to n do S[j,0] = j. g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] (-1) 0 + (+1) -1 + (-1) (-1) -1 + ( 0 ) 1 + (-1) -3 + (-1) -2 + ( 0 ) 0 + (-1) (-1) -3 + ( 0 ) -1 + (-1) (-1) -4 + (+1) -2 + (-1) (-1) -6 + ( 0 ) -4 + (-1) (-1) -5 + ( 0 ) -3 + (-1)
\course\eleg f\Topic-2a.ppt24 What To Do Next? Answer: Finding alignments But, How?
\course\eleg f\Topic-2a.ppt25 Finding the Alignment(s) A T G C A G T ATAAGTATAAGT Similarity Matrix (-1) 3 + (+1) 2 + (-1) TTTT (-1) 2 + (+1) 2 + (-1) G T (-1) 1 + (+1) 2 + (-1) A G T (-1) 1 + ( 0 ) 2 + (-1) C A G T A A G T C A G T - A G T (-1) 0 + ( 0 ) 2 + (-1) G C A G T - A A G T (-1) 0 + (+1) -1 + (-1) (-1) 2 + ( 0 ) 1 + (-1) G C A G T A - A G T (-1) 1 + (+1) 0 + (-1) T G C A G T T - A A G T T G C A G T T A - A G T A T G C A G T A T A - A G T A T G C A G T A T - A A G T Global Alignments
\course\eleg f\Topic-2a.ppt26 How to Break a Tie? Should one report all? Or, report only one?
\course\eleg f\Topic-2a.ppt27 Advantage of DP Alignment Algorithms Build up the solution by determining all similarities between arbitrary prefixes of the two sequences Starting with the shorter prefixes and use previously computed results to solve for larger prefixes
\course\eleg f\Topic-2a.ppt28 The Complexity of the DP Alignment Algorithm? Find an optimal alignment O (m + n) Construction of the similarity matrix: O (m n)
\course\eleg f\Topic-2a.ppt29 Global versus Local Alignments A global alignment attempts to match all of one sequence against all of another LGPSTKQFGKGSSSRIWDN | |||| | | LNQIERSFGKGAIMRLGDA A local alignment attempts to match subsequences of the two sequences; FGKG |||| FGKG
\course\eleg f\Topic-2a.ppt30 How to Compute Local Alignment?
\course\eleg f\Topic-2a.ppt31 Applying DP to Local Alignment Similarity Matrix Computation: a[i,j-1]+g a[i,j]= maxa[i-1,j-1]+sb(i,j) a[i-1,j]+g a[i,0]= 0 ; for i= 0…m a[0,j]= 0 ; for j= 0…n If the best alignment up to some point has a negative score, it’s better to start a new one, rather than extend the old one. Don’t penalize gaps on left and right ends!
\course\eleg f\Topic-2a.ppt32 Criteria of Finding a Local Alignment Find the entries with maximum values in the simularity matrix For each of such entries, construct an local alignment See next example We may also be interested in near-optimal alignments
\course\eleg f\Topic-2a.ppt33 A T G C A G T ATAAGTATAAGT Similarity Matrix Similarity Matrix Computation: a[i,j-1]+g a[i,j]= maxa[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 A T G C A G T A T - A A G T A T G C A G T A T A - A G T A T G C A A G T Applying DP to Local Alignment
\course\eleg f\Topic-2a.ppt34 Local Alignment using DP T G A T G G A G G T GATAGGGATAGG (-2) 0 + (-1) 0 + (-2) (-2) 0 + (+1) 0 + (-2) 0 T G A T G G A G G T A G G a[i,j-1]+g a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 a[i,j]= max A T C G ATCGATCG g = -2 T G A T - G G A G G T G A T A G G T G A T G G A G G T G A T A G T G A T G G A G G T G A T
\course\eleg f\Topic-2a.ppt35 How to Break a Tie? Should one report all? Or, report only one?
\course\eleg f\Topic-2a.ppt36 Extension to the Basic DP Method Improving space complexity Introduce general gap functions That is, the probability of a sequence of consecutive spaces is more likely than individual spaces Affine gap functions: w(k) = h + gk