Download presentation
Presentation is loading. Please wait.
1
4 - 1 Chap 4 The Sequence Alignment Problem
2
4 - 2 The Sequence Alignment Problem Introduction –What, Who, Where, Why, When, How The Sequence Alignment Problem The Local Alignment Problem The Affine Gap Penalty
3
4 - 3 Introduction What –Input: Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. –Output: The alignment of S 1, S 2, …, S n, which has the optimal score. Who –Biologists want to know the secrets of DNA sequences. –Computerists take it as an interesting problem.
4
4 - 4 Introduction (Cont’) Where –Bioinformatics. Why –To determine how close two species are. –Data compression. When –Constructing evolutionary trees. How –This is why we are here.
5
4 - 5 The Sequence Alignment Problem S 1 =GAACTG, S 2 =GAGCTG, A scoring function f is –+2 if S 1 i is aligned with S 2 j, and S 1 i = S 2 j –-1 if otherwise. GAACTG--- GA---GCTG Score = 3 x (+2)+6 x (-1) =0 GAACTG GAGCTG Score = 5 x (+2)+1 x (-1) =9
6
4 - 6 The Dynamic Programming Approach
7
4 - 7 The Dynamic Programming Approach(Cont’)
8
4 - 8 The Local Alignment Problem Input:Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. Output: Subsequences S i ’ of S i such that the score obtained by aligning S i ’ is highest, among all possible subsequences of S i. (1<= i <=n) S 1 = abbbcc S 2 = adddcc Score=3x2+3x(-1)=3 S 1 ’ = cc S 2 ’ = cc Score=2x2=4
9
4 - 9 The Local Alignment Problem(Cont’)
10
4 - 10 The Affine Gap Penalty Consider the following two sequences –S 1 =ACTTGATCC –S 2 =AGTTAGTAGTCC An optimal alignment of the above pair of sequences is as follows. –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC Original Score=12 Gap concerned alignment is as follows. –S 1 =ACTT---GATCC –S 2 =AGTTAGTAGTCC Original Score=6
11
4 - 11 The Affine Gap Penalty(Cont’) A gap is caused by a mutational event which removed a sequence of residues. A simple mutational event is more likely than several events. Therefore a long gap is often more preferable than several gaps. An affine gap penalty is defined as P g +kP e for a gap with k, k>=1, spaces where P g,P e >= 0.
12
4 - 12 The Affine Gap Penalty(Cont’) Using our previous scoring function and further let P g =4 and P e =1. –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC –Score = 8x2-1-3x(4+1x1)=16-1-15=0 –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC –Score=6x2-3x1-(4+3x1)=12-3-7=2
13
4 - 13 The Multiple Sequence Alignment Problem Consider the following case where three sequence are involved. S 1 = ATTCGAT S 2 = TTGAG S 3 = ATGCT
14
4 - 14 In two sequences alignment problem. In three sequences alignment problem.
15
4 - 15 Avery good alignment of these three sequence is now shown as follows. S 1 = ATTCGAT S 2 = -TT-GAG S 3 = AT--GCT It is noted that the alignment between every pair of sequence is quite good.
16
4 - 16 The Gusfield Approximation Algorithm for the Sum of Pairs Multiple Sequence Alignment Problem We define The distance between the two sequences induced by the alignment is define as
17
4 - 17 d(S i,S j ) has the following characteristics: (1)d(S i,S i ) = 0 (2)d(S i,S j )+ d(S i,S k ) d(S j,S k ) Give two sequences S i and S j, the minimum induced distance is denoted as D(S i,S j ).
18
4 - 18 S 1 = ATGCTC S 2 = AGAGC S 3 = TTCTG S 4 = ATTGCATGC We align the for sequence in pair. S 1 = ATGCTC S 2 = A-GAGC D(S 1,S 2 ) = 3 S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3
19
4 - 19 S 1 = AT-GC-T-C S 4 = ATTGCATGC D(S 1,S 4 ) = 3 S 2 = AGAGC S 3 = TTCTG D(S 2,S 3 ) = 5 S 2 = A--G-A-GC S 4 = ATTGCATGC D(S 2,S 4 ) = 4
20
4 - 20 S 3 = -TT-C-TG- S 4 = ATTGCATGC D(S 3,S 4 ) = 4 D(S 1,S 2 )+D(S 1,S 3 )+D(S 1,S 4 ) = 9 D(S 2,S 1 )+D(S 2,S 3 )+D(S 3,S 4 ) = 12 D(S 3,S 1 )+D(S 3,S 2 )+D(S 3,S 4 ) = 12 D(S 4,S 1 )+D(S 4,S 2 )+D(S 4,S 3 ) = 11 Give a set S of k sequences, the center of this set of sequences is the sequences which minimizes
21
4 - 21 Align S 2 with S 1 S 1 = ATGCTC S 2 = A-GAGC Add S 3 by aligning S 3 with S 1 S 1 = ATGCTC S 3 = -TTCTG =>S 1 = ATGCTC S 2 = A-GAGC S 3 = -TTCTG
22
4 - 22 Add S 4 by aligning S 4 with S 1 S 1 = AT-GC-T-C S 4 = ATTGCATGC =>S 1 = AT-GC-T-C S 2 = A--GA-G-C S 3 = -T-TC-T-G S 4 = ATTGCATGC App 2Opt.
23
4 - 23 The Minimal Spanning Tree Preservation Approach for Multiple Sequences Alignment S 1 = ATGCTC S 2 = ATGAGC S 3 = TTCTG S 4 = ATTGCATGC Step1 finds the pair wise distances optimally by the dynamic programming algorithm. S 1 = ATGCTC S 2 = ATGAGC D(S 1,S 2 ) = 2
24
4 - 24 S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3 S 1 = ATGC-T-C S 4 = ATGCATGC D(S 1,S 4 ) = 2 S 2 = ATGAGC S 3 = TTCTG- D(S 2,S 3 ) = 4
25
4 - 25 S 2 = ATG-A-GC S 4 = ATGCATGC D(S 2,S 4 ) = 2 S 3 = -TTC-TG- S 4 = ATGCATGC D(S 3,S 4 ) = 4 Table: The Distance Matrix D
26
4 - 26 S1S1 S2S2 S4S4 S3S3 2 3 2 A minimal spanning tree MST(D) For e(S 1, S 2 ) S 1 = ATGCTC S 2 = ATGAGC For e(S 2, S 4 ) S 1 =(ATG-C-TC) S 2 = ATG-A-GC S 4 = ATGCATGC
27
4 - 27 For e(S 1, S 3 ) S 1 = ATG-C-TC S 2 =(ATG-A-GC) S 3 = TT--C-TG S 4 =(ATGCATGC) Table: The Distance Matrix D m
28
4 - 28 S1S1 S2S2 S3S3 2 3 2 A minimal spanning tree MST(D m ) S4S4 Theorem: MST(D) is equal to MST(D m ). Corollary: Let e(a,b) and e(c,d) be two edges on MST(D). If D(a,b) < D(c,d), then D m (a,b) < D m (c,d).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.