4 - 1 Chap 4 The Sequence Alignment Problem
4 - 2 The Sequence Alignment Problem Introduction –What, Who, Where, Why, When, How The Sequence Alignment Problem The Local Alignment Problem The Affine Gap Penalty
4 - 3 Introduction What –Input: Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. –Output: The alignment of S 1, S 2, …, S n, which has the optimal score. Who –Biologists want to know the secrets of DNA sequences. –Computerists take it as an interesting problem.
4 - 4 Introduction (Cont’) Where –Bioinformatics. Why –To determine how close two species are. –Data compression. When –Constructing evolutionary trees. How –This is why we are here.
4 - 5 The Sequence Alignment Problem S 1 =GAACTG, S 2 =GAGCTG, A scoring function f is –+2 if S 1 i is aligned with S 2 j, and S 1 i = S 2 j –-1 if otherwise. GAACTG--- GA---GCTG Score = 3 x (+2)+6 x (-1) =0 GAACTG GAGCTG Score = 5 x (+2)+1 x (-1) =9
4 - 6 The Dynamic Programming Approach
4 - 7 The Dynamic Programming Approach(Cont’)
4 - 8 The Local Alignment Problem Input:Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. Output: Subsequences S i ’ of S i such that the score obtained by aligning S i ’ is highest, among all possible subsequences of S i. (1<= i <=n) S 1 = abbbcc S 2 = adddcc Score=3x2+3x(-1)=3 S 1 ’ = cc S 2 ’ = cc Score=2x2=4
4 - 9 The Local Alignment Problem(Cont’)
The Affine Gap Penalty Consider the following two sequences –S 1 =ACTTGATCC –S 2 =AGTTAGTAGTCC An optimal alignment of the above pair of sequences is as follows. –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC Original Score=12 Gap concerned alignment is as follows. –S 1 =ACTT---GATCC –S 2 =AGTTAGTAGTCC Original Score=6
The Affine Gap Penalty(Cont’) A gap is caused by a mutational event which removed a sequence of residues. A simple mutational event is more likely than several events. Therefore a long gap is often more preferable than several gaps. An affine gap penalty is defined as P g +kP e for a gap with k, k>=1, spaces where P g,P e >= 0.
The Affine Gap Penalty(Cont’) Using our previous scoring function and further let P g =4 and P e =1. –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC –Score = 8x2-1-3x(4+1x1)= =0 –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC –Score=6x2-3x1-(4+3x1)=12-3-7=2
The Multiple Sequence Alignment Problem Consider the following case where three sequence are involved. S 1 = ATTCGAT S 2 = TTGAG S 3 = ATGCT
In two sequences alignment problem. In three sequences alignment problem.
Avery good alignment of these three sequence is now shown as follows. S 1 = ATTCGAT S 2 = -TT-GAG S 3 = AT--GCT It is noted that the alignment between every pair of sequence is quite good.
The Gusfield Approximation Algorithm for the Sum of Pairs Multiple Sequence Alignment Problem We define The distance between the two sequences induced by the alignment is define as
d(S i,S j ) has the following characteristics: (1)d(S i,S i ) = 0 (2)d(S i,S j )+ d(S i,S k ) d(S j,S k ) Give two sequences S i and S j, the minimum induced distance is denoted as D(S i,S j ).
S 1 = ATGCTC S 2 = AGAGC S 3 = TTCTG S 4 = ATTGCATGC We align the for sequence in pair. S 1 = ATGCTC S 2 = A-GAGC D(S 1,S 2 ) = 3 S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3
S 1 = AT-GC-T-C S 4 = ATTGCATGC D(S 1,S 4 ) = 3 S 2 = AGAGC S 3 = TTCTG D(S 2,S 3 ) = 5 S 2 = A--G-A-GC S 4 = ATTGCATGC D(S 2,S 4 ) = 4
S 3 = -TT-C-TG- S 4 = ATTGCATGC D(S 3,S 4 ) = 4 D(S 1,S 2 )+D(S 1,S 3 )+D(S 1,S 4 ) = 9 D(S 2,S 1 )+D(S 2,S 3 )+D(S 3,S 4 ) = 12 D(S 3,S 1 )+D(S 3,S 2 )+D(S 3,S 4 ) = 12 D(S 4,S 1 )+D(S 4,S 2 )+D(S 4,S 3 ) = 11 Give a set S of k sequences, the center of this set of sequences is the sequences which minimizes
Align S 2 with S 1 S 1 = ATGCTC S 2 = A-GAGC Add S 3 by aligning S 3 with S 1 S 1 = ATGCTC S 3 = -TTCTG =>S 1 = ATGCTC S 2 = A-GAGC S 3 = -TTCTG
Add S 4 by aligning S 4 with S 1 S 1 = AT-GC-T-C S 4 = ATTGCATGC =>S 1 = AT-GC-T-C S 2 = A--GA-G-C S 3 = -T-TC-T-G S 4 = ATTGCATGC App 2Opt.
The Minimal Spanning Tree Preservation Approach for Multiple Sequences Alignment S 1 = ATGCTC S 2 = ATGAGC S 3 = TTCTG S 4 = ATTGCATGC Step1 finds the pair wise distances optimally by the dynamic programming algorithm. S 1 = ATGCTC S 2 = ATGAGC D(S 1,S 2 ) = 2
S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3 S 1 = ATGC-T-C S 4 = ATGCATGC D(S 1,S 4 ) = 2 S 2 = ATGAGC S 3 = TTCTG- D(S 2,S 3 ) = 4
S 2 = ATG-A-GC S 4 = ATGCATGC D(S 2,S 4 ) = 2 S 3 = -TTC-TG- S 4 = ATGCATGC D(S 3,S 4 ) = 4 Table: The Distance Matrix D
S1S1 S2S2 S4S4 S3S A minimal spanning tree MST(D) For e(S 1, S 2 ) S 1 = ATGCTC S 2 = ATGAGC For e(S 2, S 4 ) S 1 =(ATG-C-TC) S 2 = ATG-A-GC S 4 = ATGCATGC
For e(S 1, S 3 ) S 1 = ATG-C-TC S 2 =(ATG-A-GC) S 3 = TT--C-TG S 4 =(ATGCATGC) Table: The Distance Matrix D m
S1S1 S2S2 S3S A minimal spanning tree MST(D m ) S4S4 Theorem: MST(D) is equal to MST(D m ). Corollary: Let e(a,b) and e(c,d) be two edges on MST(D). If D(a,b) < D(c,d), then D m (a,b) < D m (c,d).