Pair-wise Sequence Alignment (II) Introduction to bioinformatics 2008 Lecture 6 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Slides:



Advertisements
Similar presentations
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Advertisements

Global Sequence Alignment by Dynamic Programming.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
 If Score(i, j) denotes best score to aligning A[1 : i] and B[1 : j] Score(i-1, j) + galign A[i] with GAP Score(i, j-1) + galign B[j] with GAP Score(i,
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in bioinformatics makes sense except in.
©CMBI 2005 Sequence Alignment In phylogeny one wants to line up residues that came from a common ancestor. For information transfer one wants to line up.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Sequence Analysis Lecture 3 C E N T R F O R I N T E G R A T I V E B I O I N F O.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Alignments 2: Local alignment Sequence Analysis
1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The.
Algorithms Dr. Nancy Warter-Perez June 19, May 20, 2003 Developing Pairwise Sequence Alignment Algorithms2 Outline Programming workshop 2 solutions.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Alignment methods April 21, 2009 Quiz 1-April 23 (JAM lectures through today) Writing assignment topic due Tues, April 23 Hand in homework #3 Why has HbS.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Introduction to bioinformatics Lecture 7 Multiple sequence alignment (1)
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
Learning to Align: a Statistical Approach
Introduction to Dynamic Programming
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Introduction to bioinformatics 2007
Introduction to bioinformatics 2007
Introduction to bioinformatics 2007
Sequence Alignment 11/24/2018.
Pairwise sequence Alignment.
Pairwise Sequence Alignment
Introduction to bioinformatics 2007
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Sequence alignment BI420 – Introduction to Bioinformatics
Presentation transcript:

Pair-wise Sequence Alignment (II) Introduction to bioinformatics 2008 Lecture 6 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

What can sequence alignment tell us about structure HSSP Sander & Schneider, 1991 ≥30% sequence identity

Sequence alignment History of Dynamic Programming algorithm 1970 Needleman-Wunsch global pair-wise alignment Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3): Smith-Waterman local pair-wise alignment Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147,

Global dynamic programming i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal Value from residue exchange matrix This is a recursive formula

Dynamic programming i j The cell [i, j] contains the alignment score of the best scoring alignment score of subsequence 1..i and 1..j, that is, the subsequences up to [i, j] Cell [i, j] does not ‘know’ what that best scoring alignment is (it is one out of many possibilities)

Global dynamic programming PAM250, Gap =6 (linear) SPEARE S H A K E SPEARE S H01 21 A K00 30 E These values are copied from the PAM250 matrix (see earlier slide) and represent the S(i, j) values in the DP formula (back two slides) Higgs & Attwood, p. 124 – Note: There are errors in the matrices!! SPEARE S-HAKE The easy algorithm is only for linear gap penalties

Global dynamic programming PAM250, Gap =6 (linear) SPEARE S H A K E SPEARE S H01 21 A K00 30 E These values are copied from the PAM250 matrix (see earlier slide) Higgs & Attwood, p. 124 – Note: There are errors in the matrices!! SPEARE S-HAKE The easy algorithm is only for linear gap penalties Start in left upper cell before either sequence (circled in red). Path will end in lower right cell (circled in blue)

DP is a two-step process Forward step: calculate scores Trace back: start at highest score and reconstruct the path leading to the highest score –These two steps lead to the highest scoring alignment (the optimal alignment) –This is guaranteed when you use DP!

Global dynamic programming i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal Problem with simple DP approach: Can only do linear gap penalties Not suitable for affine and concave penalties, but algorithm can be extended to deal with affine penalties (preceding lecture)

Global dynamic programming using affine penalties i-2 i-1 i j-2 j-1 j If you came from here, gap was opened so apply gap-open penalty If you came from here, gap was already open, so apply gap-extension penalty Looking back from cell (i, j) we can adapt the algorithm for affine gap penalties by looking back to four more cells (magenta)

Time and memory complexity of DP The time complexity is O(n 2 ): if you would align two sequences of n residues, you would need to perform n 2 algorithmic steps (square search matrix has n 2 cells that need to be filled) The memory (space) complexity is also O(n 2 ): if you would align two sequences of n residues, you would need a square search matrix of n by n containing n 2 cells

Global dynamic programming all types of gap penalties i-1 j-1 S i,j = s i,j + Max Max{S 0<x<i-1, j-1 - Gap(i-x-1)} S i-1,j-1 Max{S i-1, 0<y<j-1 - Gap(j-y-1)} The complexity of this DP algorithm is increased from O(n 2 ) to O(n 3 ) The gap length is known exactly and so any gap penalty regime can be used

Global dynamic programming if affine penalties are used i-1 j-1 S i,j = s i,j + Max Max{S 0<x<i-1, j-1 -G o -(i-x-1)*G e } S i-1,j-1 Max{S i-1, 0<y<j-1 -G o -(j-y-1)*G e }

DP recipe for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check –preceding row until j-2: apply appropriate score and gap penalties –preceding row until i-2: apply appropriate score and gap penalties –and cell[i-1, j-1]: apply score for cell[i-1, j-1] i-1 j-1

Global dynamic programming Affine penalties: Gap o =10, Gap e =2 DWVTALK T D W V L K DWVTALK T D W V L K These values are copied from the PAM250 matrix (see earlier slide), after being made non- negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores

DP is a two-step process Forward step: calculate scores Trace back: start at highest score and reconstruct the path leading to the highest score –These two steps lead to the highest scoring alignment (the optimal alignment) –This is guaranteed when you use DP!

Global dynamic programming

Semi-global pairwise alignment Global alignment: all gaps are penalised Semi-global alignment: N- and C-terminal gaps (end-gaps) are not penalised MSTGAVLIY--TS GGILLFHRTSGTSNS End-gaps

Semi-global dynamic programming PAM250, Gap =6 (linear) SPEARE S H A K E SPEARE S H01 21 A K00 30 E These values are copied from the PAM250 matrix (see earlier slide) Higgs & Attwood, p. 124 – Note: There are errors in the matrices!! SPEARE -SHAKE The easy algorithm is only for linear gap penalties Start in left upper cell before either sequence (circled in red). Path will end in cell anywhere in the bottom row or rightmost columns with the highest score

Semi-global dynamic programming - two examples with different gap penalties - These values are copied from the PAM250 matrix (see earlier slide), after being made non- negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)

Semi-global pairwise alignment Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – Generally: if one sequence is much longer than the other Danger: if gap penalties high -- really bad alignments for divergent sequences

There are three kinds of alignments Global alignment Semi-global alignment Local alignment

Local dynamic programming (Smith & Waterman, 1981) LCFVMLAGSTVIVGTR EDASTILCGSEDASTILCGS Amino Acid Exchange Matrix Gap penalties (open, extension) Search matrix Negative numbers AGSTVIVG A-STILCG

Local dynamic programming (Smith and Waterman, 1981) basic algorithm i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g 0 diagonal vertical horizontal

Example: local alignment of two sequences Align two DNA sequences: –GAGTGA –GAGGCGA (note the length difference)‏ Parameters of the algorithm: –Match: score(A,A) = 1 –Mismatch: score(A,T) = -1 –Gap: g = -2 M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max 0

The algorithm. Step 1: init Create the matrix Initiation –No beginning row/column –Just apply the equation… M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max jj ii A G C G G A G AGTGAG 0

The algorithm. Step 2: fill in Perform the forward step… M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max jj ii A G C G G A 1 G AGTGAG

The algorithm. Step 2: fill in Perform the forward step… M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max jj ii A G C G G A 1 G AGTGAG

The algorithm. Step 2: fill in We’re done Find the highest cell anywhere in the matrix M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max jj ii A G C G G A 1 G AGTGAG

The algorithm. Step 3: trace back Reconstruct path leading to highest scoring cell Trace back until zero: alignment path can begin and terminate anywhere in matrix Alignment: GAG GAG M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max jj ii A G C G G A 1 G AGTGAG

Local dynamic programming Match/mismatch = 1/-1, Gap =2 ATGACGT T A G A C T Fill the matrix (forward pass), then do trace back from highest cell anywhere in the matrix till you reach 0 or the beginning of a sequence

Local dynamic programming Match/mismatch = 1/-1, Gap =2 ATGACGT T A G A C T Fill the matrix (forward pass), then do trace back from highest cell anywhere in the matrix till you reach 0 or the beginning of a sequence GAC

Local dynamic programming (Smith & Waterman, 1981) i-1 j-1 S i,j = Max S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)Px} S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)Px} 0 Gap opening penalty Gap extension penalty This is the general DP algorithm, which is suitable for linear, affine and concave penalties, although for the example here affine penalties are used

Local dynamic programming

Global or Local Pairwise alignment Local Global A B A B A B BA AB Local Global A B A C C A BC ABC A B A C C

Globin fold  protein myoglobin PDB: 1MBN Alpha-helices are labelled ‘A’ (blue) to ‘H’ (red). The D helix can be missing in some globins: What happens with the alignment if D- helix containing globin sequences are aligned with ‘D-less’ ones?

 sandwich  protein immunoglobulin PDB: 7FAB Immunoglobulin structures have variable regions where numbers of amino acids can vary substantially

TIM barrel  /  protein Triose phosphate IsoMerase PDB: 1TIM The evolutionary history of this protein family has been the subject of rigorous debate. Arguments have been made in favor of both convergent and divergent evolution. Because of the general lack of sequence homology, the ancestry of this molecule is still a mystery.

What does all this mean for alignments? Alignments need to be able to skip secondary structural elements to complete domains (i.e. putting gaps opposite these motifs in the shorter sequence). Depending on gap penalties chosen, the algorithm might have difficulty with making such long gaps (for example when using high affine gap penalties), resulting in incorrect alignment. Alignments are only meaningful for homologous sequences (with a common ancestor)

There are three kinds of pairwise alignments Global alignment – align all residues in both sequences; all gaps are penalised Semi-global alignment – align all residues in both sequences; end gaps are not penalised (zero end gap penalties) Local alignment – align one part of each sequence; end gaps are not applicable

Easy global DP recipe for using affine gap penalties (after Gotoh) M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from:  any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties;  any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties;  or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell in bottom row and rightmost column and do trace-back i-1 j-1 Penalty = Pi + gap_length*Pe S i,j = s i,j + Max Max{S 0<x<i-1, j-1 - Pi - (i-x-1)Px} S i-1,j-1 Max{S i-1, 0<y<j-1 - Pi - (j-y-1)Px}

Let’s do an example: global alignment Gotoh’s DP algorithm with affine gap penalties (PAM250, Pi=10, Pe=2) Row and column ‘0’ are filled with 0, -12, -14, -16, … if global alignment is used (for N-terminal end- gaps); also extra row and column at the end to calculate the score including C-terminal end-gap penalties. Note that only ‘non-diagonal’ arrows are indicated for clarity (no arrow means that you go back to earlier diagonal cell). DWVTALK T D W V L K DWVTALK T D W V L K PAM250 Cell (D2, T4) can alternatively come from two cells (same score): ‘high-road’ or ‘low-road’

Let’s do another example: semi-global alignment Gotoh’s DP algorithm with affine gap penalties (PAM250, Pi=10, Pe=2) Starting row and column ‘0’, and extra column at right or extra row at bottom is not necessary when using semi global alignment (zero end- gaps). Rest works as under global alignment. DWVTALK T D W V L K DWVTALK T D 4-7 W V L K PAM250

Easy local DP recipe for using affine gap penalties (after Gotoh) M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from: any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties; any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties; cell[i-1, j-1]: add score for cell[i, j] or 0 Select highest scoring cell anywhere in matrix and do trace-back until zero-valued cell or start of sequence(s) i-1 j-1 Penalty = Pi + gap_length*Pe S i,j = Max S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)Px} S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)Px} 0

Let’s do yet another example: local alignment Gotoh’s DP algorithm with affine gap penalties (PAM250, Pi=10, Pe=2) Extra start/end columns/rows not necessary (no end-gaps). Each negative scoring cell is set to zero. Highest scoring cell may be found anywhere in search matrix after calculating it. Trace highest scoring cell back to first cell with zero value (or the beginning of one or both sequences) DWVTALK T D W V L K DWVTALK T0003 D4000 W02100 V00259 L0011 K00 PAM250

Dot plots Way of representing (visualising) sequence similarity without doing dynamic programming (DP) Make search matrix as for DP, but locally represent sequence similarity by averaging using a sliding window

Dot-plots Dot plots are calculated using a diagonal window of preset length that is slid through the search matrix -- typically the central cell holds the window score (e.g. sum, average)

Dot-plots a simple way to visualise sequence similarity Can be a bit messy, though... Filter: 6/10 residues have to match...

Dot-plots, what about... Insertions/deletions -- DNA and proteins Duplications (e.g. tandem repeats) – DNA and proteins Inversions -- DNA Dot plots are calculated using a diagonal window of preset length that is slid through the search matrix -- typically the central cell holds the window score (e.g. sum, average)

Dot-plots, self-comparison Direct repeat Tandem repeat Inverted repeat

a heuristic –Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. Perkins, DN (1981) The Mind's Best Work

For your first exam D1: Make sure you understand and can carry out 1. the ‘simple’ DP algorithm for global, semi- global and local alignment (using linear gap penalties but make sure you know the extension of the basic algorithm for affine gap penalties) and 2. The general DP algorithm for global, semi- global and local alignment (using linear, affine and concave gap penalties)!