Eugene W.Myers and Webb Miller
Outline Introduction Gotoh's algorithm O(N) space Gotoh's algorithm Main algorithm Implementation Conclusion
Introduction Space, not time Hirschberg’s Algorithm Maximizing the similarity score of an alignment Gotoh’s Algorithm Minimizing the difference score of a conversion Linear space version for affine gap penalties. For a megabyte of memory. W.Myers and Miller : sequences of length Altschul and Erickson : sequences length < 1070
Transformation (1/2) Hirschberg’s AlgorithmGotoh’s Algorithm Aligned Pair Affine Gap Penalties
Transformation (2/2) Match = 8, Mismatch = -5, Gap Symbol = -3, Gap-open = -4 <
Example(1/2) Hirschberg’s Algorithm Gotoh’s Algorithm Match80 Mismatch-513 Gap-open-44 Gap Symbol-37
Example(2/2) 1A : ACGGTTCAAG B : ACGGTTCAAG 2A : ACGGTTCAAG B : ACGGATCAAG 3 Hirschberg’s AlgorithmGotoh’s Algorithm Cost C (minimum)
R 黃博平
Some notations : the i-symbol prefix of A : the j-symbol prefix of B C(i, j):minimum cost of a conversion of to
Simple gap(1/4) gap(k)= h*k
Simple gap(2/4) A A G AGTACAGTAC Space= O(n^2)
Simple gap(3/4) m/2
Simple gap(4/4) Forward score and backward score Space: O(m+n)
Affine gap(1/8) A gap of length k : cost = g + k*h A T A A C T C G A A T C - - T
Affine gap(2/8) C(i, j):minimum cost of a conversion of to D(i, j):minimum cost of a conversion of to that deletes I(i, j):minimum cost of a conversion of to that inserts
Affine gap(3/8) if i > 0 and j> 0 if i = 0 and j> 0 if i > 0 and j= 0 if i = 0 and j= 0
Affine gap(4/8) if i > 0 and j> 0 if i = 0 and j> 0
Affine gap(5/8) if i > 0 and j> 0 if i > 0 and j= 0
Affine gap(6/8)
Affine gap(7/8) * * * * * * A A G AGTACAGTAC **** A A G AGTACAGTAC AGTACAGTAC C D I
Affine gap(8/8) * * * * * * A A G AGTACAGTAC **** A A G AGTACAGTAC AGTACAGTAC I D C
R 陳彥璋
Observation i-th row of C and D depends only on row i and i-1. i-th row of I depends only on row i. CDI
Linear Space Use two one-dimension arrays (CC and DD) and three variables.
Linear Space
Algorithm
* * * * * * A A G AGTACAGTAC **** A A G AGTACAGTAC AGTACAGTAC C D I g = 2.0 h = 0.5 CC DD t = 2.0
* * * * * * A A G AGTACAGTAC **** A A G AGTACAGTAC AGTACAGTAC C D I g = 2.0 h = 0.5 CC DD t = 2.0
* * * * * * A A G AGTACAGTAC **** A A G AGTACAGTAC AGTACAGTAC s c e CC DD g = 2.0 h = 0.5 i = 5 t = 4.5 C D I
* * * * * * A A G AGTACAGTAC **** A A G AGTACAGTAC AGTACAGTAC s c e CC DD t = 4.5 i = 5 j = 1 g = 2.0 h = 0.5 C D I
* * * * * * A A G AGTACAGTAC **** A A G AGTACAGTAC AGTACAGTAC s c CC DD t = 4.5 i = 5 j = 1 g = 2.0 h = 0.5 e C D I
* * * * * * A A G AGTACAGTAC **** A A G AGTACAGTAC AGTACAGTAC s CC DD t = 4.5 i = 5 j = 1 g = 2.0 h = 0.5 e c C D I
* * * * * * A A G AGTACAGTAC **** A A G AGTACAGTAC AGTACAGTAC Optimal conversion cost. CC DD C D I
What is the conversion of AGTAC and AAG ?
B 王柏易
Midpoint Hirschberg (1975): recursive divide-and-conquer Backward Computing Forward Computing
Gap Penalty i-1, j-1i, j-1 i-1, ji, j
Gap Penalty CC( j) = minimum cost of a conversion of Ai* to Bj DD( j) = minimum cost of a conversion of Ai* to Bj that ends with a delete
Gap Penalty RR(N - j) = minimum cost of a conversion of Ai* T to Bj T SS(N - j) = minimum cost of a conversion of Ai* T to Bj T that begins with a delete
Find Midpoint with Gap Penalty Backward Computing Forward Computing How to compute the midpoint?
R 李政緯
Midpoint The problem of calculating the midpoint is that when we concatenate two substrings into one, we may coalesce two gaps into one Which means that we may consider min { CC + RR, DD + SS - g, II + JJ - g}
Midpoint Recall the above algorithm, we do save the space of II and JJ. We can reduce it into min {CC + RR, DD + SS - g}
Midpoint Remember that we should find min j ∈ [0, N] {min { CC + RR, DD + SS - g, II + JJ - g}} i* j j+1
Midpoint Type 1 recurrence Type 2 recurrence i* j* i* j*
Example A = agtac, B = aag, i* = 2 agtac a__ag Recurrsive call on (a, a) and (ac, ag)
R 涂宗瑋
Implementation Storage Requirement Memory v.s. Sequence length Compared with classic dynamic programming algorithm
Storage Requirement(1/4) Vectors : CC,DD,RR, and SS Space: 4N words M + N words for an optimal conversion M = N = 38 40
Storage Requirement(2/4) words for the table(w):replacement costs 128*128 wASCII [1]ASCII [2]ASCII[3]ASCII[4]ASCII[…]ASCII[128] ASCII [1]W1,1W1,2W1,3W1,4W1,…W1,128 ASCII [2]W2,1W2,2W2,3W2,4W2,…W2,128 ASCII [3]W3,1W3,2W3,3W3,4W3,…W3,128 ASCII [4]W4,1W4,2W4,3W4,4W4,…W4,128 ASCII[…]W…,1W…,2W…,3W…,4W…,…W…,128 ASCII[128]W128,1W128,2W128,3W128,4W128,…W128,128
Storage Requirement(3/4) 16 words for the table(w):replacement costs 4*4 ATCG AW(A,A)W(A,T)W(A,C)W(A,G) TW(T,A)W(T,T)W(T,C)W(T,G) CW(C,A)W(C,T)W(C,C)W(C,G) GW(G,A)W(G,T)W(G,C)W(G,G)
Storage Requirement(4/4) M + N bytes for the sequences A and B. A and B could be compressed DNA sequences only 2(M + N) bits are necessary
Memory v.s. Sequence length Maximum length of sequences that can be aligned in a given amount of memory Altschul and Erickson : 7MN-bit approach Memory (bytes)Linear Space(w/o op.) Linear Space(with op.) Altschul and Erickson 64K k k k N = Memory / 4*4N = Memory / 6*4N = sqrt(Memory *8 / 7)
Compared with classic dynamic programming algorithm classic dynamic programming algorithm (Wagner and Fischer, 1974).
Compared with classic dynamic programming algorithm Space : classic dynamic programming algorithm : O(MN) linear-space algorithm O(N + lgM) Time : Both O(MN) But in practice, linear-space slower than classic dynamic programming algorithm. linear-space : classic DP = 2.84 : 1
R 林澤豪
C G G A T C A T CTTAACTCTTAACT Reduce problem
Reduce problem(cont.)
60 Reduce problem(cont.) m/2 Partition line