MCB 5472 Lecture #6: Sequence alignment March 27, 2014.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Introduction to Bioinformatics
Molecular Evolution Revised 29/12/06
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
1 Protein Multiple Alignment by Konstantin Davydov.
Sequence Analysis Tools
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Multiple alignment: heuristics
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Xuhua Xia Sequence Alignment Xuhua Xia
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Terminology of phylogenetic trees
Protein Sequence Alignment and Database Searching.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Sequence Alignment Xuhua Xia
Copyright OpenHelix. No use or reproduction without express written consent1.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Multiple Alignment Anders Gorm Pedersen / Henrik Nielsen
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
A Hybrid Algorithm for Multiple DNA Sequence Alignment
Sequence Based Analysis Tutorial
Presentation transcript:

MCB 5472 Lecture #6: Sequence alignment March 27, 2014

Sequence alignment As you have seen, sequence alignment is key to nearly all experiments in molecular evolution Thus far we have discussed local alignment as implemented in BLAST Global alignment: Aligns sequences over their entire length Assumes that sequences for alignment are homologous

Recall from BLAST lecture: Sequence alignment is scored: According to a substitution matrix Some substitutions are more likely than others Using affine gaps Gap opening and extension are considered separately Reflects biological reality that Alignment score is the sum of substitution and gaps scores

Pairwise alignments

Needleman-Wunsch Needleman and Wunsch (1970) J. Mol. Biol. 48: The first algorithm to computer the optimum alignment between two sequences using dynamic programming i.e., examines many possible solutions and picks the best Implemented as EMBOSS needle program Global alignments: assumes sequences should be aligned over their entire lengths

Needleman-Wunsch Works by scoring alignments sequentially and evaluating scoring for each alignment position based on previous scores Gaps in one position may force an unlikely substitution later on Calculating these scores represents a dynamic programming sub-problem; computationally efficient Evaluates all possibilities but also maps the best option Guarantees finding the best path according to the parameters used

Smith-Waterman Smith and Waterman (1981) J. Mol. Biol. 147: Local alignment version of Needleman-Wunsch Guaranteed to find the statistically best local alignment BLAST only evaluates a subset of alignment possibilities Implimented in FASTA search program

Multiple sequence alignment

Extending similar dynamic programming approaches to calculate all possible sequence alignments quickly becomes impossible Various tools therefore use different heuristic approaches to align multiple sequences Different specializations and/or motivations Different computational efficiency

ClustalW One of the first widely used multiple sequence alignment programs Thompson et al. (1994) Nuc. Acids Res. 22: Larkin et al. (2007) Bioinformatics 23: ClustalX: Widely used version with a graphical interface

ClustalW Step #1a: Align all pairs of sequences separately Current default: count kmers conserved between sequences Can also be global alignments (original defaults) Step #1b: Calculate distance matrix from pairwise comparisons Step #2: Cluster distance matrix using neighbor- joining or UPGMA algorithms to create a “guide tree” Step #3: Midpoint root tree and weight branches by sequence similarity

ClustalW guide trees Guide trees are no substitute for full phylogenetic analysis!!! Not based on multiple sequence alignment Are only a rough approximation of the true relationships between sequences Even though they can be produced by ClustalW they should not be used for detailed analysis!

ClustalW Step #4: Progressive alignment Starting from most similar sequences on guide tree, align each to each other Uses full dynamic programming alignment methods (cf. N-W) including substitution and gap penalties Gap opening parameters vary based on sequence position to favor alignment to preexisting gaps Any gaps introduced are maintained during subsequent alignment iterations

Progressive alignment example Gaps introduced earlier are propagated into later alignments M Q T I F L H - I W M Q T I F L H I W L Q S W L S F L Q S W L - S F M Q T I F L H - I W L Q S - W L - S - F Edgar (2004) BMC Bioinformatics 5:133

ClustalW Thompson et al. (1994) Nuc. Acids Res. 22: Step #1 Step #2 Step #3 Step #4

Thompson et al. (1994) Nuc. Acids Res. 22:

ClustalW No guarantee of optimal alignment Early errors propagated to more divergent sequences This is true of all multiple sequence programs Reasonably accurate when all sequences are ~ <40% identical Scales reasonably well to a few thousand sequences Easy to run! Has been superseded by better programs

ClustalW command line Easy! Type “clustalw” and follow along

MUSCLE Edgar (2004) Nuc. Acids Res. 32: Edgar (2004) BMC Bioinformatics 5:133 Designed to improve speed and accuracy over older multiple sequence alignment programs like clustalw Now a preferred alignment method, especially for high-throughput studies

MUSCLE Stage #1: create draft progressive alignment Step #1-1: calculate distance between genomes using kmers, create distance matrix Step #1-2: cluster distance matrix into guide tree using UPGMA algorithm Step #1-3: conduct progressive multiple sequence alignment Accuracy sacrificed for speed at this step So far, same as clustalw but faster and less accurate

MUSCLE Stage #2: refine progressive alignment Most inaccuracy in Stage #1 due to using kmer distances to create guide tree Step #2-1: re-create distance matrix using Kimura distances (more accurate, need input multiple sequence alignment) Step #2-2: cluster distance matrix using UPGMA Step #2-3: recalculate progressive alignment, omitting alignments that stayed the same from Step #1-3 Result: more accurate alignment than Stage #1

MUSCLE Stage #3: Alignment refinement Step #3-1: moving from root to tip in the tree from step #2-2, remove that node and spit the alignment into two subalignments Step #3-2: compute alignment profiles for each subalignment Step #3-3: re-align profiles to each other Step #3-4: determine if new alignment has a better score than the previous one, if so keep new one and goto step #3-1 using the next node in the tree Stop when scores stop improving

Profile alignment Can yield better gap placement by removing biases from original pairwise alignments Edgar (2004) BMC Bioinformatics 5:133

MUSCLE Edgar (2004) Nuc. Acids Res. 32:

MAFFT Katoh et al. (2002) Nucl. Acids Res. 30: Katoh & Standley (2013) Mol. Biol. Evol. 30: Similar to MUSCLE, designed to improve speed and accuracy of multiple sequence alignment vs. clustalw

MAFFT Instead of progressive alignments, MAFFT uses a “fast Fourier transform” Creates local alignment blocks based on physico- chemical properties of amino acids (esp. volume & polarity) Very fast! Does not require calculating alignments exhaustively, rather how blocks link together Statistical framework the same for pairwise and multiple alignments

MAFFT Scoring system dramatically simplified relative to clustalw (uses complicated heuristic normalizations) Contains an optional iterative refinement method (similar to MUSCLE) Newer versions contain robust profile alignment methods (i.e., aligning alignments) Speedup and accuracy similar to MUSCLE

PRANK Loytynoja and Goldman (2008) Science 320: Motivation: alignment programs typically group gaps together Gaps represent insertion/deletion (indel) evolutionary events Result: multiple evolutionary events are grouped together

clustalw clustalw SIV gp120 protein alignment Reconstruction of indels implies 8 independent deletions (unlikely) Loytynoja and Goldman (2008) Science 320:

PRANK Progressive alignments using substitution matrices sequentially evaluates alignments on a column-by-column basis Individual columns by themselves lack sufficient information to accurately reflect evolution of the entire sequence PRANK evaluates gap conservation during alignment refinement to decide if the gap should be used during subsequent alignment steps

PRANK of same SIV gp120 Loytynoja and Goldman (2008) Science 320:

PRANK Better models of indel events Sequence alignments are not artificially compressed (i.e., shorter than true alignments) Computational cost

SATé Liu et al. (2012) Syst. Biol. 61: Liu et al. (2009) Science 324: Something different: performs alignment and tree estimation simultaneously

SATé Multiple iterations creating new trees and alignments each time Trees constructed using Multiple Likelihood methods (v. robust) ML trees function as “guide” trees for subsequent iterations Liu et al. (2009) Science 324:

SATé Alignments subdivided along longest tree branch each iteration Many splits having increasing phylogenetic resolution Alignments merged into master alignment Tree recalculated Liu et al. (2012) Syst. Biol. 61:90-106

Which alignment method is best? Head-to-head analyses rarely cover all possibilities Depends on the expected output E.g., comparison to reference alignment E.g., effect on tree construction E.g., effect on identifying site-specific selection Trade-offs: speed vs accuracy

Alignment score compared to known reference Compute time Thompson et al. (2011) PLoS One 6:e18093 Which alignment method is best?

Loytynoja and Goldman (2008) Science 320:

Different algorithms can give different results PCoA plot of trees constructed using different alignment algorithms Blackburne and Whelan (2013) Mol. Biol. Evol. 30:

Different algorithms can give different results Correlation of sites identified as under selection using different sequence alignment algorithms Blackburne and Whelan (2013) Mol. Biol. Evol. 30:

Note: nucleotides vs. proteins Recall: proteins are more conserved compared to nucleotides Sequence alignment is therefore more robust using protein sequences Gaps in nucleotide alignments should reflect codon structures

Note: nucleotides vs. proteins Software exists to convert between protein and nucleotide sequences for alignment PAL2NAL Suyama et al. (2006) Nucl. Acids Res. 34:W609-W612http:// MEGA6: GUI version Tamura et al. (2013) Mol. Biol. Evol. 30: Doesn’t scale fantastically compared to terminal but user-friendly

Sequence masking Another way to deal with poor alignments is to remove regions thought to be inaccurate before further analysis Lose information, but optimize sensitivity/specificity tradeoff Common for phylogenetic applications Gaps are often used during phylogenetic reconstruction, so are better to remove if not actually informative Not entirely without controversy

Gblocks Talavera and Castresana (2007) Syst. Biol. 56: ks.htmlhttp://molevol.cmima.csic.es/castresana/Gbloc ks.html Identifies blocks of sequences aligned with high confidence E.g., few gaps, few columns lacking sequence conservation, confidently-aligned flanking regions

GUIDANCE Penn et al. (2010) Nucl. Acids Res. 38:W23- W28 Create alignment guide trees based on alignment columns Score compared to master alignment

Summary: Choice of multiple sequence alignment program will affect downstream analyses Different trade-offs to approach No substitute for manual inspection and correcting alignments when the resulting phylogeny really, really matters!-