Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Multiple Sequence Alignment
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Multiple alignment: heuristics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Home Work I. Running Blast with BioPerl Input: 1) Sequence or Acc.Num. 2) Threshold (E value cutoff) Output: 1) Blast results – sequence names, alignment.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple sequence alignment Dr Alexei Drummond Department of Computer Science Semester 2, 2006.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Multiple Sequence Alignment
In Bioinformatics use a computational method - Dynamic Programming.
Pairwise Sequence Alignment (cont.)
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Presentation transcript:

Multiple alignment: Feng- Doolittle algorithm

Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved regions and function (more data) Better estimate of significance when using a sequence of unknown function Must use multiple alignments when establishing phylogenetic relationships Alignment of more than two sequences Usually gives better information about conserved regions and function (more data) Better estimate of significance when using a sequence of unknown function Must use multiple alignments when establishing phylogenetic relationships

Dynamic programming extended to many dimensions? No – uses up too much computer time and space E.g. 200 amino acids in a pairwise alignment – must evaluate 4 x 10 4 matrix elements If 3 sequences, 8 x 10 6 matrix elements If 6 sequences, 6.4 x matrix elements No – uses up too much computer time and space E.g. 200 amino acids in a pairwise alignment – must evaluate 4 x 10 4 matrix elements If 3 sequences, 8 x 10 6 matrix elements If 6 sequences, 6.4 x matrix elements

Need to find more efficient method Sacrifice certainty of optimum alignment for certainty of good alignment but faster Need to find more efficient method Sacrifice certainty of optimum alignment for certainty of good alignment but faster

Feng-doolittle algorithm Does all pairwise alignments and scores them Converts pairwise scores to “distances” D = -logS eff = -log [(S obs –S rand )/(S max – S rand )] S obs = pairwise alignment score S rand = exoected score for random alignment S max = average of self-alignments of the two sequences Does all pairwise alignments and scores them Converts pairwise scores to “distances” D = -logS eff = -log [(S obs –S rand )/(S max – S rand )] S obs = pairwise alignment score S rand = exoected score for random alignment S max = average of self-alignments of the two sequences

As S max approaches S rand (increasing evolutionary distance), S eff goes down; to make the distance measure positive, use the -log

Once the distances have been calculated, construct a guide tree (more in the phylogeny class) – tells what order to group the sequences Sequences can be aligned with sequences or groups; groups can be aligned with groups Once the distances have been calculated, construct a guide tree (more in the phylogeny class) – tells what order to group the sequences Sequences can be aligned with sequences or groups; groups can be aligned with groups

Sequence-sequence alignments: dynamic programming Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned Sequence-sequence alignments: dynamic programming Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned

Example Seq1Seq2 Seq3Seq4 Seq5 Alignment 1 Alignment 2 Alignment 3 Final alignment

Notice that this method does not guarantee the optimum alignment; just a good one. Gaps are preserved from alignment to alignment: “once a gap, always a gap” Notice that this method does not guarantee the optimum alignment; just a good one. Gaps are preserved from alignment to alignment: “once a gap, always a gap”

In-class exercise Retrieve sequences from multalign.apr into BioScout Run Gap in BioScout on all combinations of the sequences in multalign.apr; use a gap penalty of 6 and an extension penalty of 2 Record alignment scores of each pairwise comparison Save pairwise alignments Retrieve sequences from multalign.apr into BioScout Run Gap in BioScout on all combinations of the sequences in multalign.apr; use a gap penalty of 6 and an extension penalty of 2 Record alignment scores of each pairwise comparison Save pairwise alignments

In class exercise, cont use raw alignment scores as distance measures; make a guide tree based on these scores In Vector NTI, select all sequences in multalign.apr (in the sequence pane); choose Alignment from the toolbar at the top; choose Alignment Setup from the pulldown; choose multiple alignment; take the defaults, choose ok; choose Alignment again, this time choose Align Selected Sequences from the pulldown use raw alignment scores as distance measures; make a guide tree based on these scores In Vector NTI, select all sequences in multalign.apr (in the sequence pane); choose Alignment from the toolbar at the top; choose Alignment Setup from the pulldown; choose multiple alignment; take the defaults, choose ok; choose Alignment again, this time choose Align Selected Sequences from the pulldown

In class exercise, cont. Note that ClustalW does some other things that the Pileup program discussed on the tape does not; we are going to ignore those things for the moment Compare ClustalW’s guide tree (visible in the Phylogenetic Tree Pane – tab at bottom of window) with yours Note that ClustalW does some other things that the Pileup program discussed on the tape does not; we are going to ignore those things for the moment Compare ClustalW’s guide tree (visible in the Phylogenetic Tree Pane – tab at bottom of window) with yours

In class exercise, cont Carefully examine ClustalW’s alignment; compare it to the individual pairwise alignments you saved. Are there differences?

Start refining alignment: Use structural info if you have it Find patterns if you don’t Use amino acid structure handout from beginning of class for substitution decisions! Start refining alignment: Use structural info if you have it Find patterns if you don’t Use amino acid structure handout from beginning of class for substitution decisions!

ClustalW Most widely used multiple alignment method Similar strategy to the Feng-Doolittle approach implemented as Pileup, but more complex and gives generally superior results Ad hoc nature of the program can be mysterious Most widely used multiple alignment method Similar strategy to the Feng-Doolittle approach implemented as Pileup, but more complex and gives generally superior results Ad hoc nature of the program can be mysterious

Advantageous differences Gap penalties vary locally: By observed frequency (in database) after each residue By simple structure prediction – lower gap penalties in probable loop regions By proximity to existing gaps – higher gap penalties when within 8 residues of an existing gap Gap penalties vary locally: By observed frequency (in database) after each residue By simple structure prediction – lower gap penalties in probable loop regions By proximity to existing gaps – higher gap penalties when within 8 residues of an existing gap

Advantages, cont. Change in substitution matrix choice depending on distance computed for guide tree Substitution matrix families Profile construction (more later) Weighting of sequences in profiles depending on evolutionary distance computed for guide tree More similar sequences get less weight than less similar sequences Change in substitution matrix choice depending on distance computed for guide tree Substitution matrix families Profile construction (more later) Weighting of sequences in profiles depending on evolutionary distance computed for guide tree More similar sequences get less weight than less similar sequences

In class exercise II Change a few parameters in the ClustalW program (gap, gap extension, substitution matrix, etc.) one at a time: this is done in Alignment Setup. After each run with a different change, save the alignment project with some descriptive name that you can remember (e.g., gap20 or blosum) Compare alignment results with different parameters changed Change a few parameters in the ClustalW program (gap, gap extension, substitution matrix, etc.) one at a time: this is done in Alignment Setup. After each run with a different change, save the alignment project with some descriptive name that you can remember (e.g., gap20 or blosum) Compare alignment results with different parameters changed

MultAlin MultAlin is also a heuristic algorithm that builds up a multiple alignment from a group of pairwise alignments It differs from Pileup and Clustal in that the guide tree is recalculated based on the results of each alignment step Because this leads to cycles of tree building and alignmnent, MultAlin can take a long time to run. It stops after the overall alignment score stops improving MultAlin is also a heuristic algorithm that builds up a multiple alignment from a group of pairwise alignments It differs from Pileup and Clustal in that the guide tree is recalculated based on the results of each alignment step Because this leads to cycles of tree building and alignmnent, MultAlin can take a long time to run. It stops after the overall alignment score stops improving

Scoring a multiple sequence alignment Assumptions: Sequences (rows) independent Positions (columns) independent Neither assumption is true … Score of a column is the (possibly weighted) sum of all the pairwise comparisons (I.e., substitution matrix values) within that column Score of a multiple alignment is the sum of scores for all columns Assumptions: Sequences (rows) independent Positions (columns) independent Neither assumption is true … Score of a column is the (possibly weighted) sum of all the pairwise comparisons (I.e., substitution matrix values) within that column Score of a multiple alignment is the sum of scores for all columns