Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:
2 Dynamic Programming Dynamic programming is a class of solution methods for solving sequential decision problems with a compositional cost structure. Richard Bellman was one of the principal founders of this approach.
3 Two key ingredients Two key ingredients for an optimization problem to be suitable for a dynamic- programming solution: Each substructure is optimal. (Principle of optimality) 1. optimal substructures 2. overlapping subproblems Subproblems are dependent. (otherwise, a divide-and- conquer approach is the choice.)
4 Three basic components The development of a dynamic- programming algorithm has three basic components: –The recurrence relation (for defining the value of an optimal solution); –The tabular computation (for computing the value of an optimal solution); –The traceback (for delivering an optimal solution).
5 Fibonacci numbers.for i>1 i F i F i F F F The Fibonacci numbers are defined by the following recurrence:
6 How to compute F 10 ? F 10 F9F9 F8F8 F8F8 F7F7 F7F7 F6F6 ……
7 Tabular computation The tabular computation can avoid recompuation. F0F0 F1F1 F2F2 F3F3 F4F4 F5F5 F6F6 F7F7 F8F8 F9F9 F
8 Maximum-sum interval Given a sequence of real numbers a 1 a 2 …a n, find a consecutive subsequence with the maximum sum. 9 –3 1 7 – –4 2 –7 6 – For each position, we can compute the maximum- sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n 2 ) time.
9 O-notation: an asymptotic upper bound f(n) = O(g(n)) iff there exist two positive constant c and n 0 such that 0 f(n) cg(n) for all n n 0 cg(n) f(n)f(n) n0n0
10 How functions grow? 30n 92n log n 26n n 3 2n2n sec sec sec. 4 x yr. 100, sec.2.6 min.3.0 days22 yr. For large data sets, algorithms with a complexity greater than O(n log n) are often impractical! n function (Assume one million operations per second.)
11 Maximum-sum interval (The recurrence relation) Define S(i) to be the maximum sum of the intervals ending at position i. aiai If S(i-1) < 0, concatenating a i with its previous interval gives less sum than a i itself.
12 Maximum-sum interval (Tabular computation) 9 –3 1 7 – –4 2 –7 6 – S(i) – – The maximum sum
13 Maximum-sum interval (Traceback) 9 –3 1 7 – –4 2 –7 6 – S(i) – – The maximum-sum interval:
14 Two fundamental problems we recently solved (I) (joint work with Lin and Jiang) Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm. U = 3 9 –3 1 7 – –4 2 –7 6 –
15 Joint work with Huang, Jiang and Lin Xiaoqiu Huang ( 黃曉秋 ) Iowa State University, USA Tao Jiang ( 姜濤 ) University of California Riverside, USA Yaw-Ling Lin ( 林耀鈴 ) Providence University, Taiwan
16 Two fundamental problems we recently solved (II) (joint work with Lin and Jiang) Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm. L =
17 Another example Given a sequence as follows: 2, 6.6, 6.6, 3, 7, 6, 7, 2 and L = 2, the highest-average interval is the squared area, which has the average value 20/3. 2, 6.6, 6.6, 3, 7, 6, 7, 2
18 C+G rich regions Our method can be used to locate a region of length at least L with the highest C+G ratio in O(n log L) time. ATGACTCGAGCTCGTCA Search for an interval of length at least L with the highest average.
19 Length-unconstrained version Maximum-average interval The maximum element is the answer. It can be done in O(n) time.
20 Q: computing density in O(1) time? prefix-sum(i) = S[1]+S[2]+…+S[i], –all n prefix sums are computable in O(n) time. sum(i, j) = prefix-sum(j) – prefix-sum(i-1) density(i, j) = sum(i, j) / (j-i+1) prefix-sum(j) i j prefix-sum(i-1)
21 A naive algorithm A simple shift algorithm can compute the highest-average interval of a fixed length in O(n) time Try L, L+1, L+2,..., n. In total, O(n 2 ).
22 An O(nL)-time algorithm (Huang, CABIOS’ 94) Observing that there exists an optimal interval of length bounded by 2L, we immediately have an O(nL)-time algorithm. We can bisect a region of length >= 2L into two segments, where each of them is of length >= L.
23 Good partners Finding good partner g(i) for each i=1, 2, …, n-L+1. L g(i)g(i) maximing avg[i, g(i)] i + L
24 Right-Skew Decomposition Partition S into substrings S 1,S 2,…,S k such that –each S i is a right-skew substring of S the average of any prefix is always less than or equal to the average of the remaining suffix. –density(S 1 ) > density(S 2 ) > … > density(S k ) [Lin, Jiang, Chao] –Unique –Computable in linear time.
25 An O(n log L)-time algorithm (Lin, Jiang, Chao, JCSS’02) Decreasingly right-skew decomposition (O(n) time)
26 An O(n log L)-time algorithm Jumping tables that allows binary search. –O(log L) time for each position to find its good partner, therefore the total running time is O(n log L). We also implemented a program that linearly scans right-skew segments for the good partnership. Our empirical tests showed that it ran in linear time empirically.
27 Goldwasswer, Kao, and Lu’s recent progress An O(n) time algorithm for the problem –Appeared in Proceedings of the Second Workshop on Algorithms in Bioinformatics (WABI), Rome, Itay, Sep , 2002.
28 A new important observation i < j < g(j) < g(i) implies –density(i, g(i)) is no more than density(j, g(j)) ig(i) j g(j)
29 k non-overlapping maximum-average segments (Lin, Huang, Jiang, Chao, Bioinformatics) Maintain all candidates in a priority queue. A new k-best algorithm + algorithms on trees (Lin and Chao (2003))
30 Longest increasing subsequence(LIS) The longest increasing subsequence is to find a longest increasing subsequence of a given sequence of distinct integers a 1 a 2 …a n. e.g are increasing subsequences. are not increasing subsequences. We want to find a longest one.
31 A naive approach for LIS Let L[i] be the length of a longest increasing subsequence ending at position i. L[i] = 1 + max j = 0..i-1 {L[j] | a j < a i } (use a dummy a 0 = minimum, and L[0]=0) L[i] ?
32 A naive approach for LIS L[i] L[i] = 1 + max j = 0..i-1 {L[j] | a j < a i } The maximum length The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence. This method runs in O(n 2 ) time.
33 Binary search Given an ordered sequence x 1 x 2... x n, where x 1 <x 2 <... <x n, and a number y, a binary search finds the largest x i such that x i < y in O(log n) time. n... n/2 n/4
34 Binary search How many steps would a binary search reduce the problem size to 1? n n/2 n/4 n/8 n/ How many steps? O(log n) steps.
35 An O(n log n) method for LIS Define BestEnd[k] to be the smallest number of an increasing subsequence of length k BestEnd[1] BestEnd[2] BestEnd[3] BestEnd[4] BestEnd[5] BestEnd[6]
36 An O(n log n) method for LIS Define BestEnd[k] to be the smallest number of an increasing subsequence of length k BestEnd[1] BestEnd[2] BestEnd[3] BestEnd[4] BestEnd[5] BestEnd[6] For each position, we perform a binary search to update BestEnd. Therefore, the running time is O(n log n).
37 Longest Common Subsequence (LCS) A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent. The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.
38 LCS For instance, Sequence 1: president Sequence 2: providence Its LCS is priden. president providence
39 LCS Another example: Sequence 1: algorithm Sequence 2: alignment One of its LCS is algm. a l g o r i t h m a l i g n m e n t
40 How to compute LCS? Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. len(i, j): the length of an LCS between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, len(i, j)can be computed as follows.
41
42
43
44
45 Bioinformatics
46 Bioinformatics and Computational Biology- Related Journals: Bioinformatics (previously called CABIOS) Bulletin of Mathematical Biology Computers and Biomedical Research Genome Research Genomics Journal of Bioinformatics and Computational Biology Journal of Computational Biology Journal of Molecular Biology Nature Science
47 Bioinformatics and Computational Biology- Related Conferences: the first IEEE Computer Society Bioinformatics Conference (CSB 2002, CA, USA) Intelligent Systems for Molecular Biology (ISMB 2003, Brisbane, Australia) Pacific Symposium on Biocomputing (PSB 2003, Kauai, Hawaii, USA) The Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB 2003, Berlin, Germany)
48 Bioinformatics and Computational Biology-Related Books: Calculating the Secrets of Life: Applications of the Mathematical Sciences in Molecular Biology, by Eric S. Lander and Michael S. Waterman (1995) Introduction to Computational Biology: Maps, Sequences, and Genomes, by Michael S. Waterman (1995) Introduction to Computational Molecular Biology, by Joao Carlos Setubal and Joao Meidanis (1996) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield (1997) Computational Molecular Biology: An Algorithmic Approach, by Pavel Pevzner (2000) Introduction to Bioinformatics, by Arthur M. Lesk (2002)
49 Useful Websites MIT Biology Hypertextbook – 7001main.htmlhttp:// 7001main.html The International Society for Computational Biology: – National Center for Biotechnology Information (NCBI, NIH): – European Bioinformatics Institute (EBI): – DNA Data Bank of Japan (DDBJ): –
50 Sequence Alignment
51 Dot Matrix Sequence A : CTTAACT Sequence B : CGGATCAT C G G A T C A T CTTAACTCTTAACT
52 C---TTAACT CGGATCA--T Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Sequence A Sequence B
53 C---TTAACT CGGATCA--T Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Insertion gap Match Mismatch Deletion gap
54 Alignment Graph Sequence A: CTTAACT Sequence B: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T
55 A simple scoring scheme Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C T T A A C T C G G A T C A - - T = +12 Alignment score
56 An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows.
57 Computing S i,j i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n
58 Initializations C G G A T C A T CTTAACTCTTAACT
59 S 3,5 = ? ? C G G A T C A T CTTAACTCTTAACT
60 S 3,5 = ? C G G A T C A T CTTAACTCTTAACT optimal score
61 C T T A A C – T C G G A T C A T C G G A T C A T CTTAACTCTTAACT 8 – 5 – = 14
62 Global Alignment vs. Local Alignment global alignment : local alignment :
63 An optimal local alignment S i,j : the score of an optimal local alignment ending at a i and b j With proper initializations, S i,j can be computed as follows.
64 local alignment ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3
65 local alignment C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3 The best score
C G G A T C A T CTTAACTCTTAACT The best score A – C - T A T C A T = 18
67 Affine gap penalties Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) Each gap is charged an extra gap-open penalty: -4. C T T A A C T C G G A T C A - - T = Alignment score: 12 – 4 – 4 = 4
68 Affine gap panalties A gap of length k is penalized x + k·y. gap-open penalty gap-symbol penalty Three cases for alignment endings: 1....x...x 2....x x an aligned pair a deletion an insertion
69 Affine gap penalties Let D(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j.
70 Affine gap penalties
71 Affine gap penalties SI D SI D SI D SI D -y -x-y -y w(a i,b j )
72 k best local alignments Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) BLAST (Altschul et al., 1990; Altschul et al., 1997)
73 k best local alignments Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) –linear-space version: sim (Huang and Miller, 1991) –linear-space variants: sim2 (Chao et al., 1995); sim3 (Chao et al., 1997) –Chaining local alignments: Gap3 (Huang and Chao, 2003) FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) –linear-space band alignment (Chao et al., 1992) BLAST (Altschul et al., 1990; Altschul et al., 1997) –Enhanced PatternHunter (Yang et al., coming soon)
74 FASTA 1)Find runs of identities, and identify regions with the highest density of identities. 2)Re-score using PAM matrix, and keep top scoring segments. 3)Eliminate segments that are unlikely to be part of the alignment. 4)Optimize the alignment in a band.
75 FASTA Step 1: Find runes of identities, and identify regions with the highest density of identities. Sequence A Sequence B
76 FASTA Step 2: Re-score using PAM matrix, and keep top scoring segments.
77 FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.
78 FASTA Step 4: Optimize the alignment in a band.
79 BLAST Basic Local Alignment Search Tool (by Altschul, Gish, Miller, Myers and Lipman) The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.
80 The maximal segment pair measure A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. (for DNA: Identities: +5; Mismatches: -4) the highest scoring pair The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. BLAST heuristically attempts to calculate the MSP score.
81 BLAST 1)Build the hash table for Sequence A. 2)Scan Sequence B for hits. 3)Extend hits.
82 BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences: Seq. A = AGATCGAT AAA AAC.. AGA 1.. ATC 3.. CGA 5.. GAT TCG 4.. TTT For protein sequences: Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T;
83 BLAST Step2: Scan sequence B for hits.
84 BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.) BLAST 2.0 saves the time spent in extension, and considers gapped alignments.
85 Remarks Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. The idea of filtration was used in both FASTA and BLAST.
86 Linear-space ideas Hirschberg, 1975; Myers and Miller, 1988 m/2
87 Two subproblems ½ original problem size m/2 m/4 3m/4
88 Four subproblems ¼ original problem size m/2 m/4 3m/4
89 Time and Space Complexity Space: O(M+N) Time: O(MN)*(1+ ½ + ¼ + …) = O(MN) 2
90 Band Alignment (Joint work with W. Pearson and W. Miller) Sequence B Sequence A
91 Band Alignment in Linear Space The remaining subproblems are no longer only half of the original problem. In worst case, this could cause an additional log n factor in time.
92 Band Alignment in Linear Space
93 Multiple sequence alignment (MSA) The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC
94 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= + +
95 MSA for three sequences an O(n 3 ) algorithm
96 General MSA For k sequences of length n: O(n k ) NP-Complete (Wang and Jiang) The exact multiple alignment algorithms for many sequences are not feasible. Some approximation algorithms are given. (e.g., 2- l/k for any fixed l by Bafna et al.)
97 Progressive alignment A heuristic approach proposed by Feng and Doolittle. It iteratively merges the most similar pairs. “Once a gap, always a gap” A B C D E The time for progressive alignment in most cases is roughly the order of the time for computing all pairwise alignment, i.e., O(k 2 n 2 ).
98 Concluding remarks Three essential components of the dynamic- programming approach: –the recurrence relation –the tabular computation –the traceback The dynamic-programming approach has been used in a vast number of computational problems in bioinformatics.