CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment
Roadmap Review of last lecture –Biology –Dynamic programming Sequence alignment
PolymerMonomer DNADeoxyribonucleotides RNARibonucleotides ProteinAmino Acid
Carboxyl group Amino group Protein zoom-in Side chain R H2N RRRRR COOH N-terminal C-terminal … Composed of a chain of amino acids. R | H 2 N--C--COOH | H
Genome, Chromosome, Gene
DNA Replication The process of copying a double-stranded DNA molecule –Semi-conservative 5’-ACATGATAA-3’ 3’-TGTACTAT-5’ 5’-ACATGATAA-3’ 3’-TGTACTATT-5’
Transcription (where genetic information is stored) (for making mRNA) Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA. DNA-RNA pair: A=U, C=G T=A, G=C
The Genetic Code Third letter
Translation The sequence of codons is translated to a sequence of amino acids Gene: -GCT TGT TTA CGA ATT- mRNA: -GCU UGU UUA CGA AUU - Peptide: - Alu - Cys - Leu - Arg - Ile – Start codon: AUG –Also code Met –Stop codon: UGA, UAA, UAA
Dynamic programming What is dynamic programming? –Solve an optimization problem by tabulating sub-problem solutions (memorization) rather than re-computing them
Elements of dynamic programming Optimal sub-structures –Optimal solutions to the original problem contains optimal solutions to sub-problems –Solutions to sub-problems are independent Overlapping sub-problems –Some sub-problems appear in many solutions –We should not solve each sub-problem for more than once Memorization and reuse –Carefully choose the order that sub-problems are solved –Tabulate the solutions –Bottom-up
Example Find the shortest path in a grid s g (0,0) (3,3)
Optimal substructure If a path P(s, g) is optimal, any sub-path, P(s,x), where x is on P(s,g), is also optimal Proof by contradiction –If the path between P(s,x) is not the shortest, i.e., P’(s,x) < P(s,x) –Construct a new path P’(s,g) = P’(s,x) + P(x, g) –P’(s,g) P(s,g) is not the shortest –Contradiction
Overlapping sub-problems Some sub-problems are used by many paths (0,0) -> (2,0) used by 3 paths
Memorization and reuse Easy to tabulate and reuse –Number of sub-problems ~ number of nodes –P(s, x), for x in all nodes except s and g Find an order such that no sub-problems need to be recomputed –First compute the smallest sub-problems –Use solutions of small sub-problems to solve large sub-problems
Example: shortest path
Example: shortest path
Example: shortest path
Analysis For a nxn grid Enumeration: –number of paths = (2n!)/(n!)^2 –Each path has 2n steps –Total operation: 2n * (2n!) / (n!)^2 = O(2^(2n)) Recursive call: O(2^(2n)) DP: O(n^2)
EnumerationRecursionDP N= N=52,5201,03260 N=103,695,1201,048,576420
Example: Fibonacci Seq F(n) = F(n-1) + F(n-2), F(0) = F(1) = 1 Function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2);
Time complexity: O(1.62^n)
Example: Fibonacci Seq function fib(n) F[0] = 1;F[1] = 1; For i = 2 to n F[n] = F[n-1] + F[n-2]; End Return F[n];
Time: O(n), space: O(n)
What if it is not so easy to figure out an order to fill in the table? Exercise
Today’s lecture Sequence alignment –Global alignment
Why seq alignment? Similar sequences often have similar origin or function –Two genes are said to be homologous if they share a common evolutionary history. –Evolutionary history can tell us a lot about properties of a given gene –Homology can be inferred from similarity between the genes New protein sequences are always compared to sequence databases to search for proteins with same or similar functions Most widely used computational tools in biology
Evolution at the DNA level …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… C Sequence edits: Mutation, deletion, insertion
Evolutionary Rates OK X X Still OK? next generation
Sequence conservation implies function
Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition An alignment of two string S, T is a pair of strings S ’, T ’ (with spaces) s.t. (1) |S ’ | = |T ’ |, and (|S| = “ length of S ” ) (2) removing all spaces in S ’, T ’ leaves S, T AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC
What is a good alignment? Alignment: The “ best ” way to match the letters of one sequence with those of the other How do we define “ best ” ?
The score of aligning (characters or spaces) x & y is σ (x,y). Score of an alignment: An optimal alignment: one with max score S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Scoring Function Sequence edits: AGGCCTC –Mutations AGGACTC –InsertionsAGGGCCTC –DeletionsAGG-CTC Scoring Function: Match: +m~~~AAC~~~ Mismatch: -s~~~A-A~~~ Gap (indel):-d
More complex scoring function Substitution matrix –Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor –It is usually defined by log likelihood ratio (Durbin book) –Active research area. Especially for proteins. –Commonly used: PAM, BLOSUM
An example substitution matrix ACGT A3-2-2 C3 G3-2 T3
Match = 2, mismatch = -1, gap = -1 Score = 3 x 2 – 2 x 1 – 1 x 1 = 3
How to find it? A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz
Analysis Assume |S| = |T| = n Cost of evaluating one alignment: ≥n How many alignments are there: –pick n chars of S,T together –say k of them are in S –match these k to the k unpicked chars of T Total time: E.g., for n = 20, time is > 2 40 >10 12 operations
Dynamic Programming We will now describe a dynamic programming algorithm Suppose we wish to align x 1 ……x M y 1 ……y N Let F(i,j) = optimal score of aligning x 1 ……x i y 1 ……y j
Dynamic Programming (cont ’ d) Notice three possible cases: 1.x M aligns to y N ~~~~~~~ x M ~~~~~~~ y N 2.x M aligns to a gap ~~~~~~~ x M ~~~~~~~ - 3.y N aligns to a gap ~~~~~~~ - ~~~~~~~ y N m, if x M = y N F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d
Therefore: F(M-1, N-1) + (X M,Y N ) F(M,N) = max F(M-1, N) – d F(M, N-1) – d (X M,Y N ) = m if X M = Y N, and –s otherwise Each sub-problem can be solved recursively
Generalize: F(i-1, j-1) + (X i,Y j ) F(i,j) = max F(i-1, j) – d F(i, j-1) – d Be careful with the boundary conditions
Remember: –The recursive formula is for understanding the relationship between sub-problems –We cannot afford to really solve them recursively Number of sub-problems: –Each corresponds to calculating an F(i, j) –O(MN) of them –Solve all of them
What order to fill? F(0,0) F(M,N)
F(i-1, j-1) + (X i,Y j ) F(i, j) = max F(i-1, j) – d F(i, j-1) – d F(i, j)F(i, j-1) F(i-1, j)F(i-1, j-1) [case 1] [case 2] [case 3] 1 2 3
What order to fill? F(0,0) F(M,N)
Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A T A F(i,j) i = j =
Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A T-2 A-3 j = F(i,j) i =
Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T A-3 j = F(i,j) i =
Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 j = F(i,j) i =
Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 F(i,j) i =
Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 This only tells us the best score F(i,j) i =
Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =
Trace-back AGTA A10 -2 T 0010 A-3 02 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d x = AGTAm = 1 y = ATAs = -1 d = -1 j = F(i,j) i =
Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =
Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =
Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 AGTA A TA F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =
In some cases, trace-back may be very time consuming Alternative solution: remember where you come from! –Trade-off: more memory
Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A T-2 A-3 j = F(i,j) i =
Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T A-3 j = F(i,j) i =
Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 j = F(i,j) i =
Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =
Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =
Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =
Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =
Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 AGTA A TA F(i,j) i =
The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j d c.F(i, 0)= - i d 2.Main Iteration. Filling in scores a.For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + σ(x i, y j ) [case 3] UP, if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment
Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods
A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG Then, we don ’ t want to penalize gaps in the ends
The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y N ……………………………… y 1
Different types of overlaps x y x y
A non-bio variant Shell command “diff” in unix –Given file1 and file2 –Find the difference between file1 and file2 –Similar to sequence alignment –How to score? Longest common subsequence (LCS) Match has score 1 No mismatch penalty No gap penalty
File1 A B C D E F File2 G B C E F
File1 A B C D E F File2 G B C - E F $ diff file1 file2 1c1 < A --- > G 4c4 < D --- > - LCS = 4
The LCS variant Changes: 1.Initialization For all i, j, F(i, 0) = F(0, j) = 0 2.Filling in table F(i-1,j) F(i, j) = max F(i, j-1) F(i-1, j-1) + σ(x i, y j ) where σ(x i, y j ) = 1 if x i = y j and 0 otherwise. 3.Termination max i F(i, N) F OPT = max max j F(M, j)
What happens if you have 1 million lines of text in each file? Slow –What if the majority of the two files are the same? (e.g., two versions of a software) –Bounded DP Memory inefficient –At least 1000 GB memory –Linear-space algorithm, same time complexity
See you next week