Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic Programming and Biological Sequence Comparison Part I.

Similar presentations


Presentation on theme: "Dynamic Programming and Biological Sequence Comparison Part I."— Presentation transcript:

1 Dynamic Programming and Biological Sequence Comparison Part I

2 \course\eleg667-01-f\Topic-2a.ppt2 Topic II – Biological Sequence Alignment and Database Search  Part I (Topic-2a): Dynamic programming and Sequence comparison  Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment  Part III (Topic-2c): Multiple sequence alignment

3 \course\eleg667-01-f\Topic-2a.ppt3 Outline  Concept of alignment  Two algorithm design techniques;  Dynamic Programming: Examples  Applying DP to Sequence Comparison;  The database search problem  Heuristic algorithms to database search

4 \course\eleg667-01-f\Topic-2a.ppt4 Alignment  The two sequences will have the same length (after possible insertions of spaces on either or both of them)  No space in one sequence can be aligned with a space in the other  Spaces can be inserted at the beginning or end of the sequences

5 \course\eleg667-01-f\Topic-2a.ppt5 Biological Sequence Alignment and Database Search 1.We have two sequences over the same alphabet, both about the same length (tens of thousands of characters) and the sequences are almost equal. The average frequency of these differences is low, say, one each hundred characters. We want to find the places where the differences occur. 2.We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there is a prefix of one which is similar to suffix of the other.

6 \course\eleg667-01-f\Topic-2a.ppt6 3.We have the same problem as in (2), but now we have several hundred sequences that must be compared (each one against all). In addition, we know that the great majority of sequence pairs are unrelated, that is, they will not have the required degree of similarity. 4.We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there are two substrings, one from each sequence, that are similar. 5.We have the same problem as in (4), but instead of two sequences we have one sequence that must be compared to thousands of others. (cont’d)

7 \course\eleg667-01-f\Topic-2a.ppt7 Breaking Problems Down:  Divide and Conquer: Starting with the complete instance of a problem, divide it into smaller subinstances, solve each of them recursively and combine the partial solutions into a solution to the original problem.  Dynamic Programming: Starting with the smallest subinstances of a problem, solve and combine them until the complete instance of the original problem is solved. Two Related Algorithm Design Techniques

8 \course\eleg667-01-f\Topic-2a.ppt8 Divide and Conquer – Example 1 9 1 25 4 15 4 1 9 25 15 becomes 4 125 15 becomes 1 4 15 25 14 15 25 Quick Sort

9 \course\eleg667-01-f\Topic-2a.ppt9 Divide and Conquer – Example 2 The Fibonacci numbers Fib(n) { if (n < 2) return 1; else return Fib(n-1)+Fib(n-2); } F 1 = 1, F 2 = 1 F n = F n-1 + F n-2 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …

10 \course\eleg667-01-f\Topic-2a.ppt10 Divide and Conquer – Example 2 F 1 = 1, F 2 = 1 F n = F n-1 + F n-2 F(7) F(3) + F(2) F(1) F(4) + F(2) F(6) + F(3) + F(2) F(1) F(3) + F(2) F(1) F(4) + F(2) F(5) + + F(3) + F(2) F(1) F(3) + F(2) F(1) F(4) + F(2) F(5) + n 1 2 3 4 5 6 7 8 9 10 11 … F n 1 1 2 3 5 8 13 21 34 55 89 … F n / F n-1  1.6 F n  1.6 n, n >> 1 T(n)  #Internal_nodes = #leaves - 1 but #leaves = F n T(n) = O(1.6 n ) Exponential Time!

11 \course\eleg667-01-f\Topic-2a.ppt11 How to Compute Fib Function Using Dynamic Programming Method?

12 \course\eleg667-01-f\Topic-2a.ppt12 Dynamic Programming–Example 1 Fib(n) { int tab[n]; tab[1] = 1; tab[2] = 1; for (j = 3; j <= n; j++) tab[j]=tab[j-1] + tab[j-2]; return tab[n]; } Start by solving the smallest problems Use the partial solutions to solve bigger and bigger problems Extra memory to store intermediate values 1 1 2 3 5 8 13 21 34 55 89 …. tab Linear Time! T(n) = O(n) Space-Time Tradeoff

13 \course\eleg667-01-f\Topic-2a.ppt13 Sequence Comparison Molecular sequence data are at the heart of Computational Biology  DNA sequences  RNA sequences  Protein sequences We can think of these sequences as strings of letters  DNA & RNA: alphabet of 4 letters (A,T,C,G)  Protein: alphabet of 20 letters code full name A alanine C cysteine D aspartate E glutamate F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N aspartamine P proline Q glutamine R arginine S serine T threonine V valine W tryptophan Y tyrosine

14 \course\eleg667-01-f\Topic-2a.ppt14 Sequence Comparison – (Cont.) Why compare sequences?  Find similar genes/proteins Allows to predict function & structure  Locate common subsequences in genes/proteins Identify common recurrent patterns  Locate sequences that might overlap Help in sequence assembly

15 \course\eleg667-01-f\Topic-2a.ppt15 Sequence X = A T A A G T Sequence Y = A T G C A G T To compare the sequences we need to quantify the similariy matches = 1 mismatches = 0 Score 1 1 0 0 0 0 0 Total = 2 Sequence Comparison – (Cont.)

16 \course\eleg667-01-f\Topic-2a.ppt16 Sequence Y = A T G C A G T Sequence X = A T A A G T Sequence Comparison – (Cont.) Sequence X = A T A A G T Taking positions of the letters into account matches = 1 mismatches = 0 Score 0 0 0 0 1 1 1 Total = 3

17 \course\eleg667-01-f\Topic-2a.ppt17 Sequence Y = A T G C A G T Sequence X = A T A A G T Sequence Comparison – (Cont.) Sequence X = A T A - A G T How to take possible mutations into account? matches = 1 mismatches = 0 gap = -1 Score 1 1 0 – 1 1 1 1 Total = 4 matches = 1 mismatches = 0

18 \course\eleg667-01-f\Topic-2a.ppt18 Applying DP to Sequence Comparison Sequence X = GA Sequence Y = AG G - A G - - A GAGA - G A - GA - - - AG GA - - - A GA - A G - A - A - G - - - AG GA A - G - AG - GA A - - - G - A -G - G AG - - G AG - GA - - - - AG GA - - AG G - A - AG G - A - - A -G G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG - - - GA AG - - scores -2 0 -30 -3 0 -4-4-2-4-20 -4 -2 -4-4 T(n,n) = O(k n ) Exponential Time! choose the best score, i.e max(-2, 0, -2) choose the best score, i.e max(-3, 0, -1) choose the best score, i.e max(-1, 0, -3) choose the best score, i.e max(-1, 0, -1) total score = 0

19 \course\eleg667-01-f\Topic-2a.ppt19 G A AGAG Applying DP to Sequence Comparison Sequence X = GA Sequence Y = AG G - A G - - A GAGA - G A - GA - - - AG GA - - - A GA - A G - A - A - G - - - AG GA A - G - AG - GA A - - - G - A -G - G AG - - G AG - GA - - - - AG GA - - AG G - A - AG G - A - - A -G G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG - - - GA AG - - -2 0 -30 -3 0 -4-4-2-4-20 -4 -2 -4-4 0 0-2 0 0 0 T(n,n) = O(n 2 ) Polynomial Time!

20 \course\eleg667-01-f\Topic-2a.ppt20 Questions  Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example?  Answer: Let us count Total = 13 G A 0 -1 -2 A -1 0 0 G -2 0 0 3 5 7 1 2 4 6 8 9 Question: from 1 to 9 how many paths? 1 3 5 2 8 6 9999999 999 99 9 87 8 78 5 5 8 7 4 7 7

21 \course\eleg667-01-f\Topic-2a.ppt21 DP algorithm for Sequence Comparison int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i. g for j = 0 to n do S[j,0] = j. g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] sb[i,j] - Substitution Matrix 1000 0100 0010 0001 A T C G ATCGATCG Start by solving the smallest problems Extra memory to store intermediate values Use the partial solutions to solve bigger and bigger problems

22 \course\eleg667-01-f\Topic-2a.ppt22 The Substitution Matrix  For DNA we usually use identity matrices; 1000 0100 0010 0001 A T C G ATCGATCG For proteins more sensitive matrices, derived empirically, are used; A B C D E F G H I K L M N P Q R S T V W Y Z A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3

23 \course\eleg667-01-f\Topic-2a.ppt23 Sequence Comparison revisited A T G C A G T ATAAGTATAAGT -2-3-4-5 0210-2-3 12110 -2012210 -311232 0 -2-3 -2 -3 -4-5-6 -4 -5 -7 -6 -4-201124 Similarity Matrix int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i. g for j = 0 to n do S[j,0] = j. g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] 1 1 -1 + (-1) 0 + (+1) -1 + (-1) 0 0 -2 + (-1) -1 + ( 0 ) 1 + (-1) -3 + (-1) -2 + ( 0 ) 0 + (-1) -2 -4 + (-1) -3 + ( 0 ) -1 + (-1) -3 -5 + (-1) -4 + (+1) -2 + (-1) -5 -7 + (-1) -6 + ( 0 ) -4 + (-1) -4 -6 + (-1) -5 + ( 0 ) -3 + (-1)

24 \course\eleg667-01-f\Topic-2a.ppt24 What To Do Next? Answer: Finding alignments But, How?

25 \course\eleg667-01-f\Topic-2a.ppt25 Finding the Alignment(s) A T G C A G T ATAAGTATAAGT 10-2-3-4-5 0210-2-3 12110 -2012210 -311232 0 -2-3 -2 -3 -4-5-6 -4 -5 -7 -6 -4-201124 Similarity Matrix 4 2 + (-1) 3 + (+1) 2 + (-1) TTTT 3 1 + (-1) 2 + (+1) 2 + (-1) G T 2 1 + (-1) 1 + (+1) 2 + (-1) A G T 1 0 + (-1) 1 + ( 0 ) 2 + (-1) C A G T A A G T C A G T - A G T 1 -1 + (-1) 0 + ( 0 ) 2 + (-1) G C A G T - A A G T 1 -1 + (-1) 0 + (+1) -1 + (-1) 2 1 + (-1) 2 + ( 0 ) 1 + (-1) G C A G T A - A G T 2 0 + (-1) 1 + (+1) 0 + (-1) T G C A G T T - A A G T T G C A G T T A - A G T A T G C A G T A T A - A G T A T G C A G T A T - A A G T Global Alignments

26 \course\eleg667-01-f\Topic-2a.ppt26 How to Break a Tie?  Should one report all?  Or, report only one?

27 \course\eleg667-01-f\Topic-2a.ppt27 Advantage of DP Alignment Algorithms  Build up the solution by determining all similarities between arbitrary prefixes of the two sequences  Starting with the shorter prefixes and use previously computed results to solve for larger prefixes

28 \course\eleg667-01-f\Topic-2a.ppt28 The Complexity of the DP Alignment Algorithm?  Find an optimal alignment O (m + n)  Construction of the similarity matrix: O (m n)

29 \course\eleg667-01-f\Topic-2a.ppt29 Global versus Local Alignments  A global alignment attempts to match all of one sequence against all of another LGPSTKQFGKGSSSRIWDN | |||| | | LNQIERSFGKGAIMRLGDA A local alignment attempts to match subsequences of the two sequences; -------FGKG-------- |||| -------FGKG--------

30 \course\eleg667-01-f\Topic-2a.ppt30 How to Compute Local Alignment?

31 \course\eleg667-01-f\Topic-2a.ppt31 Applying DP to Local Alignment Similarity Matrix Computation: a[i,j-1]+g a[i,j]= maxa[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 0 0 0 00000.. a[i,0]= 0 ; for i= 0…m a[0,j]= 0 ; for j= 0…n If the best alignment up to some point has a negative score, it’s better to start a new one, rather than extend the old one. Don’t penalize gaps on left and right ends!

32 \course\eleg667-01-f\Topic-2a.ppt32 Criteria of Finding a Local Alignment  Find the entries with maximum values in the simularity matrix  For each of such entries, construct an local alignment  See next example  We may also be interested in near-optimal alignments

33 \course\eleg667-01-f\Topic-2a.ppt33 A T G C A G T ATAAGTATAAGT 1000100 0210011 1121101 1112210 0021232 0000 0 0 0 000 0 0 0 0 0112124 Similarity Matrix Similarity Matrix Computation: a[i,j-1]+g a[i,j]= maxa[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 A T G C A G T A T - A A G T A T G C A G T A T A - A G T A T G C A A G T Applying DP to Local Alignment

34 \course\eleg667-01-f\Topic-2a.ppt34 Local Alignment using DP T G A T G G A G G T GATAGGGATAGG 0100110 110 0000 0 0 0 000 0 0 0 0 000 0020002 000 1003100 101 0011201 000 0100231 210 0100131 231 0 0 + (-2) 0 + (-1) 0 + (-2) 0 1 0 + (-2) 0 + (+1) 0 + (-2) 0 T G A T G G A G G T A G G a[i,j-1]+g a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 a[i,j]= max 1 1 1 1 A T C G ATCGATCG g = -2 T G A T - G G A G G T G A T A G G T G A T G G A G G T G A T A G T G A T G G A G G T G A T

35 \course\eleg667-01-f\Topic-2a.ppt35 How to Break a Tie?  Should one report all?  Or, report only one?

36 \course\eleg667-01-f\Topic-2a.ppt36 Extension to the Basic DP Method  Improving space complexity  Introduce general gap functions That is, the probability of a sequence of consecutive spaces is more likely than individual spaces Affine gap functions: w(k) = h + gk


Download ppt "Dynamic Programming and Biological Sequence Comparison Part I."

Similar presentations


Ads by Google