Download presentation
Presentation is loading. Please wait.
1
Sequence comparison: Dynamic programming
Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
2
Sequence comparison overview
Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need a method for scoring alignments, and an algorithm for finding the alignment with the best score. The alignment score is calculated using a substitution matrix gap penalties. The algorithm for finding the best alignment is dynamic programming (DP).
3
A simple alignment problem.
Problem: find the best pairwise alignment of GAATC and CATAC. Use a linear gap penalty of -4. Use the following substitution matrix: A C G T 10 -5
4
How many possibilities?
GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C How many different possible alignments of two sequences of length n exist?
5
How many possibilities?
GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C How many different alignments of two sequences of length n exist? x102 x105 x1011 x1017 x1023 FYI for two sequences of length m and n, possible alignments number: 2n choose n - the binomial coefficient
6
DP matrix 5 G A T C GA CA j 0 1 2 3 etc. i 1 2 3 4 5
10 -5 5 The value at (i,j) is the score of the best alignment of the first i characters of one sequence versus the first j characters of the other sequence. GA CA init. row and column j etc. G A T C i 1 2 3 4 5
7
A C G T 10 -5 DP matrix GAA CA- G A T C 5 1 Moving horizontally in the matrix introduces a gap in the sequence along the left edge.
8
A C G T 10 -5 GA- CAT DP matrix G A T C 5 1 Moving vertically in the matrix introduces a gap in the sequence along the top edge.
9
Moving diagonally in the matrix aligns two residues
C G T 10 -5 GAA CAT DP matrix G A T C 5 Moving diagonally in the matrix aligns two residues
10
Initialization G A T C Start at top left and move progressively A C G
10 -5 Initialization Start at top left and move progressively G A T C
11
A C G T 10 -5 G - Introducing a gap G A T C -4
12
A C G T 10 -5 - C Introducing a gap G A T C -4
13
Complete first row and column
G T 10 -5 Complete first row and column ----- CATAC G A T C -4 -8 -12 -16 -20
14
Three ways to get to i=1, j=1
C G T 10 -5 Three ways to get to i=1, j=1 G- -C j etc. G A T C -4 -8 i 1 2 3 4 5
15
Three ways to get to i=1, j=1
C G T 10 -5 -G C- j etc. G A T C -4 -8 i 1 2 3 4 5
16
Three ways to get to i=1, j=1
C G T 10 -5 G C j etc. G A T C -5 i 1 2 3 4 5
17
Accept the highest scoring of the three
10 -5 Accept the highest scoring of the three G A T C -4 -8 -12 -16 -20 -5 Then simply repeat the same rule progressively across the matrix
18
A C G T 10 -5 DP matrix G A T C -4 -8 -12 -16 -20 -5 ?
19
DP matrix G A T C -4 -8 -12 -16 -20 -5 ? -G CA G- CA --G CA- -4 -9 -12
10 -5 -G CA G- CA --G CA- DP matrix -4 -9 -12 G A T C -4 -8 -12 -16 -20 -5 ? -4 -4
20
DP matrix G A T C -4 -8 -12 -16 -20 -5 -G CA G- CA --G CA- -4 -9 -12
10 -5 -G CA G- CA --G CA- DP matrix -4 -9 -12 G A T C -4 -8 -12 -16 -20 -5 -4 -4
21
A C G T 10 -5 DP matrix G A T C -4 -8 -12 -16 -20 -5 ?
22
A C G T 10 -5 DP matrix G A T C -4 -8 -12 -16 -20 -5
23
A C G T 10 -5 DP matrix G A T C -4 -8 -12 -16 -20 -5 ?
24
What is the alignment associated with this entry?
10 -5 Traceback G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 What is the alignment associated with this entry? (just follow the arrows back - this is called the traceback)
25
DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 -G-A CATA A C G T
10 -5 DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 -G-A CATA
26
Continue and find the optimal global alignment, and its score.
10 -5 DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 ? Continue and find the optimal global alignment, and its score.
27
A C G T 10 -5 Best alignment starts at bottom right and follows traceback arrows to top left G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17
28
One best traceback GA-ATC CATA-C G A T C -4 -8 -12 -16 -20 -5 -9 -13
10 -5 GA-ATC CATA-C One best traceback G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17
29
Another best traceback
G T 10 -5 GAAT-C -CATAC Another best traceback G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17
30
GAAT-C -CATAC GA-ATC CATA-C G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1
10 -5 GAAT-C -CATAC GA-ATC CATA-C G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17
31
Multiple solutions GA-ATC
CATA-C When a program returns a single sequence alignment, it may not be the only best alignment but it is guaranteed to be one of them. In our example, all of the alignments at the left have equal scores. GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC
32
DP in equation form Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.
33
DP equation graphically
take the max of these three
34
Dynamic programming Yes, it’s a weird name.
DP is closely related to recursion and to mathematical induction. We can prove that the resulting score is optimal.
35
What you should know Scoring a pairwise alignment requires a substitution matrix and gap penalties. Dynamic programming (DP) is an efficient algorithm for finding an optimal alignment. Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to that position. DP iteratively fills in the matrix using a simple mathematical rule. How to use DP to find an alignment.
36
Practice problem: find a best pairwise alignment of GAATC and AATTC
10 -5 G A T C d = -4
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.