CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms
Roadmap Review of last lecture More global sequence alignment algorithms
Given a scoring scheme, –Match: m –Mismatch: -s –Gap: -d We can easily compute an optimal alignment by dynamic programming
In a completed alignment between a pair of sequences X = x 1 x 2 …x M, Y = y 1 y 1 …y N If we look at any column of the alignment, there are only three possibilities –x i is aligned to y j –x i is aligned to a gap –y j is aligned to a gap
Since the alignment score F(M, N) is a sum of all aligned columns, it can be broken down to: F(M-1, N-1) + (x M, y N ) F(M, N) = max F(M-1, N) - d F(M, N-1) - d
And recursively: F(i-1, j-1) + (x i, y j ) F(i, j) = max F(i-1, j) - d F(i, j-1) - d
F(0,0) F(M,N)
F(0,0) F(M,N)
AAAA G-G- TTTT AAAA Trace-back AGTA A10 -2 T 0010 A-3 02 F(i,j) j = i = AAAA G-G- TTTT AAAA
Graph representation (0,0) (3,4) A G TA A A T S1 = S2 = Number of steps: length of the alignment Path length: alignment score Alignment: find the longest path from (0, 0) to (3, 4) General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible. : a gap in the 2 nd sequence : a gap in the 1 st sequence : match / mismatch Values on vertical/horizontal line: -d Values on diagonal: m or -s
Question If we change the scoring scheme, will the optimal alignment be changed? –Original: Match = 1, mismatch = gap = -1 –New: match = 2, mismatch = gap = 0 –New: Match = 2, mismatch = gap = -2?
Number of alignments Is equal to the number of distinct paths from (0, 0) to (m, n) A BCBC A BCBC A BCBC A BCBC A BCBC A- BC A-- -BC --A BC- -A- B-C -A BC
How to count? –Homework assignment –Hint: dynamic programming –Or analytically
However Biologically meaningful “distinct” alignments may be much less –All three may be considered equivalent –A, B, and C all aligned to gaps A BCBC A BCBC A BCBC A-- -BC --A BC- -A- B-C
Number of alignments We only care about who is aligned to whom, not the gaps For two sequences of length m, n, there may be k matches, k = 0 to min(m, n) Number of alignments:
Furthermore A BCBC A BCBC A- BC A-- -BC Alternating gaps are discouraged / prohibited. With most scoring scheme, alternating gaps will never happen. (as long as 2d > s) => -d m or -s
A BCBC A BCBC A BCBC A BCBC A BCBC A- BC A-- -BC --A BC- -A- B-C -A BC Special trick? No. In most scoring scheme this is achieved automatically –2d > s
Number of alignments Homework assignment Dynamic programming –Multiple matrices –Three states: Came from diagonal. Can go any of the three directions
Number of alignments Homework assignment Dynamic programming –Multiple matrices –Three states: Came from diagonal. Can go any of the three directions Came from left, cannot go down
Number of alignments Homework assignment Dynamic programming –Multiple matrices –Three states: Came from diagonal. Can go any of the three directions Came from left, cannot go down Came from above, cannot turn right
Given two sequences of length M, N Time: O(MN) –ok Space: O(MN) –bad –1Mb seq x 1Mb seq = 1000G memory Can we do better?
In biology, this kind of alignment is unlikely to be meaningful abcde vwxyz
Good alignment should appear near the diagonal
Bounded Dynamic Programming If we know that x and y are very similar Assumption: # gaps(x, y) < k xixi Then,|implies | i – j | < k yj yj
Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ (x i, y j ) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k Termination:same x 1 ………………………… x M y N ………………………… y 1 k
Analysis Time: O(kM) << O(MN) Space: O(kM) with some tricks 2k M => M
What if we don’t know k? Iterate: –For k = 2, 4, 8, 16, … –For each k, we can have an optimal bounded alignment with score S k –Stop when ((min(N, M)-k) * m – 2kd) < S k, since we will not be able to get a higher score with larger k
Given two sequences of length M, N Time: O(MN) –ok Space: O(MN) –bad –1mb seq x 1mb seq = 1000G memory Can we do better?
Linear space algorithm If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (if you are crafty enough, you only need one row) But how do we get the alignment?
Linear space algorithm When we finish, we know how we have aligned the ends of the sequences Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN)) XMYNXMYN
Hirschberg’s idea Divide and conquer! M/2 F(M/2, k) represents the best alignment between x 1 x 2 …x M/2 and y 1 y 2 …y k Forward algorithm Align x 1 x 2 …x M/2 with Y X Y
Backward Algorithm M/2 B(M/2, k) represents the best alignment between reverse(x M/2 x M/2+1 …x M ) and reverse(y k y k+1 …y N ) Backward algorithm Align reverse(x M/2 x M/2+1 …x M ) with reverse(Y) Y X
Lemma F(M/2, k) + B(M/2, k) is the best alignment under the constraint that x M/2 must be aligned to y k F(M, N) = max k=0…N ( F(M/2, k) + B(M/2, k) ) x y M/2 k*k* F(M/2, k) B(M/2, k)
Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(3,k,6,6) (0,0) (6,6) (3,2) (3,4)(3,6)(3,0)
Linear-space alignment Now, using 2 rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2
Linear-space alignment Now, we can find k * maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k * Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2
Linear-space alignment Iterate this procedure to the two sub-problems! N-k * M/2 k*k*
Analysis Memory: O(N) for computation, O(N+M) to store the optimal alignment Time: –MN for first iteration –k M/2 + (N-k) M/2 = MN/2 for second –… k N-k M/2
MNMN/2MN/4 MN/8 MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …) = 2MN = O(MN)