Sequence Alignment Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao
GenBank 200.0
GenBank 215.0
GenBank 220.0
orz’s sequence evolution orz (kid) OTZ (adult) Orz (big head) Crz (motorcycle driver) on_ (soldier) or2 (bottom up) oΩ (back high) STO (the other way around) Oroz (me) the origin? their evolutionary relationships? their putative functional relationships?
What? The truth is more important than the facts. THETR UTHIS MOREI
Dot Matrix
Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: C---TTAACT CGGATCA--T Sequence A Sequence B
Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Mismatch Match C---TTAACT CGGATCA--T Deletion gap Insertion gap
Alignment Graph C---TTAACT CGGATCA--T Sequence A: CTTAACT Sequence B: CGGATCAT C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T
A simple scoring scheme Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score
An optimal alignment -- the alignment of maximum score Let A=a1a2…am and B=b1b2…bn . Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj With proper initializations, Si,j can be computed as follows.
Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n
Initializations C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 C T T A A C T
S3,5 = ? C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 ? C T T A A C T
S3,5 = ? C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 7-3=4 -3+8=5 -5-3=-8 C T T A A C T
S3,5 = 5 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 10 -8 -11 -14 14 C T T A A C T optimal score
C T T A A C – T C G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 10 -8 -11 -14 14 C T T A A C T
Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal alignment?
Initializations G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 C AA T T G A
S4,2 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 ? C AA T T G A
S4,2 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 0-3=-3 -11-5=-16 -14-3=-17 C AA T T G A
S5,5 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 -17 ? C AA T T G A
S5,5 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 -17 16-3=13 19-5=14 C AA T T G A
S5,5 = 14 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 -17 14 24 21 18 32 29 1 27 C AA T T G A optimal score
C A A T - T G A G A A T C T G C -5 +8 +8 +8 -3 +8 +8 -5 = 27 -5 +8 +8 +8 -3 +8 +8 -5 = 27 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 -17 14 24 21 18 32 29 1 27 C AA T T G A
Global Alignment vs. Local Alignment
Maximum-sum interval Given a sequence of real numbers a1a2…an , find a consecutive subsequence with the maximum sum. 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 For each position, we can compute the maximum-sum interval ending at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.
Computing a segment sum in O(1) time? Input: a sequence of real numbers a1a2…an Query: the sum of ai ai+1…aj
Computing a segment sum in O(1) time prefix-sum(i) = a1+a2+…+ai all n prefix sums are computable in O(n) time. sum(i, j) = prefix-sum(j) – prefix-sum(i-1) j i prefix-sum(j) prefix-sum(i-1)
Maximizing sum(i, j) sum(i, j) = prefix-sum(j) – prefix-sum(i-1) O(n)-time Method 1 sum(i, j) = prefix-sum(j) – prefix-sum(i-1) For each location j, prefix-sum(j) is fixed. To compute the maximum-sum interval ending at position j can be done by finding the minimum prefix-sum before position j. j i prefix-sum(j) prefix-sum(i-1)
Maximum-sum interval (The recurrence relation) Define S(i) to be the maximum sum of the intervals ending at position i. O(n)-time Method 2 ai If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.
Maximum-sum interval (Tabular computation) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum sum
Maximum-sum interval (Traceback) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum-sum interval: 6 -2 8 4
An optimal local alignment Si,j: the score of an optimal local alignment ending at (i, j) between a1a2…ai and b1b2…bj. With proper initializations, Si,j can be computed as follows.
local alignment C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T
local alignment C G G A T C A T 8 5 2 3 13 11 C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 2-3=-1 5+8=13 3-3=0 C T T A A C T
local alignment C G G A T C A T 8 5 2 3 13 11 10 7 18 C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 10 7 18 C T T A A C T The best score
A – C - T A T C A T 8-3+8-3+8 = 18 C G G A T C A T 8 5 2 3 13 11 10 7 8 5 2 3 13 11 10 7 18 C T T A A C T The best score
Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal local alignment?
Did you get it right? G A A T C T G C 8 5 2 3 16 13 10 7 4 24 21 18 15 8 5 2 3 16 13 10 7 4 24 21 18 15 12 19 29 26 23 37 34 32 C AA T T G A
A A T – T G A A T C T G 8+8+8-3+8+8 = 37 G A A T C T G C 8 5 2 3 16 13 10 7 4 24 21 18 15 12 19 29 26 23 37 34 32 C AA T T G A
Osamu Gotoh
Affine gap penalties C - - - T T A A C T C G G A T C A - - T Match: +8 (w(a, b) = 8, if a = b) Mismatch: -5 (w(a, b) = -5, if a ≠ b) Each gap symbol: -3 (w(-,b) = w(a,-) = -3) Each gap is charged an extra gap-open penalty: -4. -4 -4 C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score: 12 – 4 – 4 = 4
Affine gap panalties A gap of length k is penalized x + k·y. gap-open penalty Three cases for alignment endings: ...x ...x ...x ...- ...- ...x gap-symbol penalty an aligned pair This is the same as the scoring scheme that penalizes the first symbol x + y and an extended symbol y. a deletion an insertion
Affine gap penalties Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.
Affine gap penalties (A gap of length k is penalized x + k·y.)
Affine gap penalties S I D S I D -y w(ai,bj) -x-y S I D D -x-y I S -y
Constant gap penalties Match: +8 (w(a, b) = 8, if a = b) Mismatch: -5 (w(a, b) = -5, if a ≠ b) Each gap symbol: 0 (w(-,b) = w(a,-) = 0) Each gap is charged a constant penalty: -4. -4 -4 C - - - T T A A C T C G G A T C A - - T +8 0 0 0 +8 -5 +8 0 0 +8 = +27 Alignment score: 27 – 4 – 4 = 19
Constant gap penalties Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.
Constant gap penalties
Restricted affine gap panalties A gap of length k is penalized x + f(k)·y. where f(k) = k for k <= c and f(k) = c for k > c Five cases for alignment endings: ...x ...x ...x ...- ...- ...x and 5. for long gaps an aligned pair a deletion an insertion
Restricted affine gap penalties
D(i, j) vs. D’(i, j) Case 1: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length <= c D(i, j) >= D’(i, j) Case 2: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length >= c D(i, j) <= D’(i, j)
Max{S(i,j)-x-ky, S(i,j)-x-cy}