Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons.

Slides:



Advertisements
Similar presentations
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Advertisements

Global Sequence Alignment by Dynamic Programming.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture 6, Thursday April 17, 2003
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
C T C G T A GTCTGTCT Find the Best Alignment For These Two Sequences Score: Match = 1 Mismatch = 0 Gap = -1.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
Developing Pairwise Sequence Alignment Algorithms
Needleman Wunsch Sequence Alignment
Sequence Alignment.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Cédric Notredame (19/10/2015) Using Dynamic Programming To Align Sequences Cédric Notredame.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
DNA, RNA and protein are an alien language
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
INTRODUCTION TO BIOINFORMATICS
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Introduction to bioinformatics 2007
Global, local, repeated and overlaping
Using Dynamic Programming To Align Sequences
Pairwise sequence Alignment.
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Find the Best Alignment For These Two Sequences
Pairwise Alignment Global & local alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Sequence comparison: Significance of similarity scores
Dynamic Programming Finds the Best Score and the Corresponding Alignment O Alignment: Start in lower right corner and work backwards:
Presentation transcript:

Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall Need a method that is both reliable and efficient to compare two sequences Exhaustive comparison of every possible alignment will give good answers but takes too much time Need a method that is both reliable and efficient to compare two sequences Exhaustive comparison of every possible alignment will give good answers but takes too much time

Bioiformatics I Fall Dynamic programming: strategy Break alignment problem into small pieces Optimize first piece Then extend into second piece; since first piece is optimized already, program only needs to optimize extension Continue until end of comparison Break alignment problem into small pieces Optimize first piece Then extend into second piece; since first piece is optimized already, program only needs to optimize extension Continue until end of comparison

Bioiformatics I Fall Gaps Remember we said we need to penalize gaps (mimicking evolution) Simplest gap scoring: assign the same penalty (d) to every gap space: this is not very realistic More advanced gap scoring: assign a larger penalty (d) to the first space of a gap, a smaller penalty (e) to the following spaces of the same gap: affine scoring Remember we said we need to penalize gaps (mimicking evolution) Simplest gap scoring: assign the same penalty (d) to every gap space: this is not very realistic More advanced gap scoring: assign a larger penalty (d) to the first space of a gap, a smaller penalty (e) to the following spaces of the same gap: affine scoring

Bioiformatics I Fall Global alignment: Needleman-Wunsch What you need to start: Matrix of sequences to be aligned example: sequence example from text Substitution matrix (choose one that makes sense) example: BLOSUM50 Gap penalty example: -8 Start at 0 (top left) – this allows a “gap” in the beginning of the alignment What you need to start: Matrix of sequences to be aligned example: sequence example from text Substitution matrix (choose one that makes sense) example: BLOSUM50 Gap penalty example: -8 Start at 0 (top left) – this allows a “gap” in the beginning of the alignment

Bioiformatics I Fall Dynamic programming process Fill in the matrix starting from the top left; each time you move away from a diagonal you add a gap penalty to the score in the position you started in; each time you move on a diagonal you add the score from the substitution matrix

Bioiformatics I Fall Fill in the values for “gaps” at the beginning (start with 0) HEAG P A W H

Bioiformatics I Fall For example, if you aligned the H with an empty space, you would get a score of –8 for that space in this example HEAG -PAWH Arrow indicates adding score from 0 For example, if you aligned the H with an empty space, you would get a score of –8 for that space in this example HEAG -PAWH Arrow indicates adding score from 0

Bioiformatics I Fall If you aligned both H and E with an empty space, you would get a score of –16 in the E space, because you add the gap penalty onto the score in the preceding space (didn’t move diagonally); arrow from –8. HEAG --PAWH If you aligned both H and E with an empty space, you would get a score of –16 in the E space, because you add the gap penalty onto the score in the preceding space (didn’t move diagonally); arrow from –8. HEAG --PAWH

Bioiformatics I Fall Similar reasoning allows you to fill in the first column HEAG P-8 A-16 W-24 H-32

Bioiformatics I Fall Now, there are 3 possibilities to fill each remaing matrix element. So, if you aligned P with H, you move from 0 along the diagonal, so you add the substitution matrix value of -2. HEAG P-8 A-16 W-24 H-32 -2

Bioiformatics I Fall Or, you could start with H aligned with a gap, and then align P with a gap H- -P HEAG P-8 A-16 W-24 H

Bioiformatics I Fall Or, you could start with P aligned with a gap, and then align H with a gap -H P- HEAG P-8 A-16 W-24 H

Bioiformatics I Fall We choose the highest value, and preserve it and the information about where we started to get there (arrow) HEAG P-8 A-16 W-24 H

Bioiformatics I Fall Now we get to the P/E matrix element. There are 3 ways we could get to this position: HE.. -P.. HE... P-.. HE-.. --P.. Now we get to the P/E matrix element. There are 3 ways we could get to this position: HE.. -P.. HE... P-.. HE-.. --P..

Bioiformatics I Fall Note that only one of these possibilities actually aligns P with E; that is the one that moves diagonally These possibilities have different scores; we enter the highest score, and draw an arrow to the matrix element from which we moved to get this score Note that only one of these possibilities actually aligns P with E; that is the one that moves diagonally These possibilities have different scores; we enter the highest score, and draw an arrow to the matrix element from which we moved to get this score

Bioiformatics I Fall HE.. -P.. HE... P-.. HE-. --P. HE.. -P.. HE... P-.. HE-. --P. Score = = -9 Score = = -10 Score = = -24

Bioiformatics I Fall HEAG P A-16 W-24 H-32 In this case, the highest score from the three parent matrix elements was along the diagonal

Bioiformatics I Fall Using the same logic, you can fill in all the other cells in the matrix We can also express this process using matrix notation X and Y are sequences; X 1…i, Y 1…j Matrix F, F(i,j) is the score of the best alignment between the initial part of x (to x i ) and the initial part of y (to y j ) Using the same logic, you can fill in all the other cells in the matrix We can also express this process using matrix notation X and Y are sequences; X 1…i, Y 1…j Matrix F, F(i,j) is the score of the best alignment between the initial part of x (to x i ) and the initial part of y (to y j )

Bioiformatics I Fall Remember, the strategy is to optimize the first bits and then extend; so we are looking for the best score of F(i,j) which can come from extending from F(i-1,j-1) diagonal, or F(i-1,j) across or F(i, j-1) down Since we started at the beginning of the sequences, this process takes account of all possible alignments, giving us the best one Remember, the strategy is to optimize the first bits and then extend; so we are looking for the best score of F(i,j) which can come from extending from F(i-1,j-1) diagonal, or F(i-1,j) across or F(i, j-1) down Since we started at the beginning of the sequences, this process takes account of all possible alignments, giving us the best one

Bioiformatics I Fall We can express this by: F(i-1, j-1) + s(x i, y j ), F(i-1, j) - d F(i, j-1) – d where s = score from substitution matrix and d = linear gap penalty We can express this by: F(i-1, j-1) + s(x i, y j ), F(i-1, j) - d F(i, j-1) – d where s = score from substitution matrix and d = linear gap penalty F(i,j) = max

Bioiformatics I Fall So now what? So now, we look for the path through the matrix that gives the final score – in this kind of global alignment, the last cell of the matrix is by definition the best score for the alignment. Looking for the path is called traceback – you follow the pointers that got you to the end (like Hansel and Gretel …)

Bioiformatics I Fall By following the arrows, you can arrive at the alignment Only one alignment is found in this treatment, but the algorithm can be modified to recover more than one see example from the text By following the arrows, you can arrive at the alignment Only one alignment is found in this treatment, but the algorithm can be modified to recover more than one see example from the text

Bioiformatics I Fall In-class exercise II Using identity scoring and a gap penalty d = 1 (consider spaces before and after ends of sequences to be gaps), complete the matrix on the following slide Do a traceback to find the optimal alignment Using identity scoring and a gap penalty d = 1 (consider spaces before and after ends of sequences to be gaps), complete the matrix on the following slide Do a traceback to find the optimal alignment

Bioiformatics I Fall In-class exercise II: complete the matrix GAACTTA 0 A C C T T T

Bioiformatics I Fall In-class exercise III Use Gap program to align sequences in nosalign file Vary the gap initiation penalty and the gap extension penalty; compare alignments Change the substitution matrix keeping all other variables same; compare alignments Use Gap program to align two unrelated sequences Use Gap program to align sequences in nosalign file Vary the gap initiation penalty and the gap extension penalty; compare alignments Change the substitution matrix keeping all other variables same; compare alignments Use Gap program to align two unrelated sequences

Bioiformatics I Fall Instructions for Gap exercise In seqlab, bioinfI.list, select nosalign; get into Editor Select 2 sequences; use info button if necessary to find out what these sequences are; select Edit  Remove gaps  All gaps Select Functions  Pairwise Comparison  Gap Select Options; select penalize end gaps like other gaps, then Close, then Run In seqlab, bioinfI.list, select nosalign; get into Editor Select 2 sequences; use info button if necessary to find out what these sequences are; select Edit  Remove gaps  All gaps Select Functions  Pairwise Comparison  Gap Select Options; select penalize end gaps like other gaps, then Close, then Run

Bioiformatics I Fall Note the quality score of this alignment Now systematically vary the gap penalties and the substitution matrices and run the program (always penalizing end gaps) on the same pair of sequences; note the quality scores for each variation See what happens if you don’t penalize end gaps Don’t save this as anything, just go to main list when you are done Note the quality score of this alignment Now systematically vary the gap penalties and the substitution matrices and run the program (always penalizing end gaps) on the same pair of sequences; note the quality scores for each variation See what happens if you don’t penalize end gaps Don’t save this as anything, just go to main list when you are done

Bioiformatics I Fall Go back to main list; select unrelated; use info button to find out what these sequences are Run the Gap program (penalizing end gaps) Is this alignment meaningful? Check by using the Generate statistics from randomized alignments feature in options; choose preserving nucleotide or amino acid composition and take other defaults; note the average score and standard deviation from randomizations and compare to the score of the alignment See what happens when you don’t penalize end gaps Go back to main list; select unrelated; use info button to find out what these sequences are Run the Gap program (penalizing end gaps) Is this alignment meaningful? Check by using the Generate statistics from randomized alignments feature in options; choose preserving nucleotide or amino acid composition and take other defaults; note the average score and standard deviation from randomizations and compare to the score of the alignment See what happens when you don’t penalize end gaps

Bioiformatics I Fall Local alignment: Smith- Waterman This is very similar to Needleman- Wunsch, with two major differences: Must allow for starting a new alignment rather than extending one Must allow for alignment to end before the end of the sequences This is very similar to Needleman- Wunsch, with two major differences: Must allow for starting a new alignment rather than extending one Must allow for alignment to end before the end of the sequences

Bioiformatics I Fall Allowing for starting a new alignment is done by allowing F(i,j) to take the value 0 if all other options are <0 0 F(i -1, j-1) + s(x i, y j ) F(i – 1, j) – d F(i, j – 1) - d Allowing for starting a new alignment is done by allowing F(i,j) to take the value 0 if all other options are <0 0 F(i -1, j-1) + s(x i, y j ) F(i – 1, j) – d F(i, j – 1) - d F(I,j) = max

Bioiformatics I Fall Allowing for the alignment to end before the end of the sequence is taken care of by looking for the highest score in the matrix, and starting the traceback from there until a 0 is reached.

Bioiformatics I Fall In-class exercise IV Use Bestfit to find local alignments for the same sequences (from nosalign and from unrelated) you used in the previous exercises; note that you do not have an option about penalizing end gaps as you did in Gap Vary the same parameters you did before Use randomizations to evaluate alignments Use Bestfit to find local alignments for the same sequences (from nosalign and from unrelated) you used in the previous exercises; note that you do not have an option about penalizing end gaps as you did in Gap Vary the same parameters you did before Use randomizations to evaluate alignments

Bioiformatics I Fall Affine gap penalties To distinguish between a gap initiation (d) and a gap extension (e) penalty, we have to distinguish between a sequence element aligned with another sequence element, and one aligned to a gap There are two ways to be aligned to a gap: x i aligned to a gap in y, or y j aligned to a gap in x To distinguish between a gap initiation (d) and a gap extension (e) penalty, we have to distinguish between a sequence element aligned with another sequence element, and one aligned to a gap There are two ways to be aligned to a gap: x i aligned to a gap in y, or y j aligned to a gap in x

Bioiformatics I Fall In building our F matrix, remember we want to maximize the score of the extension, so we have to take into account the three parent possibilities. We simply define 3 cases and how to extend them; so the algorithm extends according to which case it starts from. M(i,j) = best score for two sequence characters aligned I x = best score for x i aligned with a gap in y I y = best score for y j aligned with a gap in x In building our F matrix, remember we want to maximize the score of the extension, so we have to take into account the three parent possibilities. We simply define 3 cases and how to extend them; so the algorithm extends according to which case it starts from. M(i,j) = best score for two sequence characters aligned I x = best score for x i aligned with a gap in y I y = best score for y j aligned with a gap in x

Bioiformatics I Fall M(i-1,j-1) + s(x i,y j ) I x (i-1,j-1) + s(x i,y j ) I y (i-1,j-1) + s(x i,y j ) M(i-1,j) – d I x (i-1,j) – e M(i, j-1) – d I y (i,j-1) - e M(i-1,j-1) + s(x i,y j ) I x (i-1,j-1) + s(x i,y j ) I y (i-1,j-1) + s(x i,y j ) M(i-1,j) – d I x (i-1,j) – e M(i, j-1) – d I y (i,j-1) - e M (i,j) = max I x (I,j) = max I y (I,j) = max