Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Multiple Sequence Alignment
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
Clustal W and Clustal X version 2.0 김영호, 박준호, 최현희 The 9 th Protein Folding Winter School.
COFFEE: an objective function for multiple sequence alignments
BNFO 602 Multiple sequence alignment Usman Roshan.
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Bioinformatics and Phylogenetic Analysis
Expected accuracy sequence alignment
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Multiple sequence alignment Usman Roshan.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple sequence alignment Monday, December 6, 2010 Bioinformatics J. Pevsner
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Multiple sequence alignment and their reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,
Chapter 3 Computational Molecular Biology Michael Smith
Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler Presented by Syed Nabeel.
Multiple sequence alignment
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Classifying MSA Packages Multiple Sequence Alignments in the Genome Era Cédric Notredame Information Génétique et Structurale CNRS-Marseille, France.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Expected accuracy sequence alignment Usman Roshan.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Lab 4.11 Lab 4.1: Multiple Sequence Alignment Jennifer Gardy Molecular Biology & Biochemistry Simon Fraser University.
Multiple Sequence Alignment
Multiple Alignment Anders Gorm Pedersen / Henrik Nielsen
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Overview of Multiple Sequence Alignment Algorithms
Presentation transcript:

Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple Sequence Alignment,” presentation at ISMB/ECCB Sievers et al., “Fast, scalable generation of high quality protein multiple sequence alignments using Clustal Omega,” unpublished manuscript, Presented by Hershel Safer in Ron Shamir’s group meeting on Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 117 August 2011

Outline Background on multiple sequence alignment (MSA) Considerations for a new MSA tool Clustal Ω Benchmarking: Methods and issues Benchmarking results References Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 217 August 2011

Example of MSA: Globins Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 317 August 2011 From Higgins 2011

Example continued: Red columns are alpha helices Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 417 August 2011 From Higgins 2011

Approaches to finding MSAs Exact solution using dynamic programming: Finding “optimal” MSA for N sequences of length L takes time O(L N ) Progressive alignment: Greedy heuristic that mimics evolution. Start by creating guide tree that specifies “evolutionary closeness.” Complexity is O(N 2 ) for fixed L. Build increasingly large sub-alignments in the order specified by the guide tree. Complexity is O(N). Works for up to a few thousand sequences Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 517 August 2011

Example of progressive alignment Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 617 August 2011 From Higgins 2011

Example of progressive alignment, cont’d. Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 717 August 2011 From Higgins 2011

Example of progressive alignment, cont’d. Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 817 August 2011 From Higgins 2011

Features of progressive alignment Advantages Fast Gives pretty good results on large problems Provides good basis for manual tweaking Disadvantages Hard to know if a solution is good – no objective function Errors are not corrected. Once two sequences are aligned, they keep the same relative alignment (e.g., later indels apply identically to both sequences). Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 917 August 2011

Consistency criterion Addresses problem of errors introduced by early mis-alignments Use library of pairwise alignments that is created for building the guide tree For each pair of aligned residues in the library, check their alignment in other pairwise alignments. Scores for progressive alignment are modified to reflect consistency across the entire library of pairwise comparisons. Helps avoid early mis-alignment. Complexity: worst case O(N 3 L 2 ), in practice O(N 3 L). Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1017 August 2011

Two kinds of popular MSA tools Fast (<10,000 sequences) Clustal W MAFFT (with --partree, can handle >>10,000 sequences) Muscle Kalign Accurate but slow (<100s of sequences) T-Coffee ProbCons MSAProbs Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1117 August 2011

Outline Background on multiple sequence alignment (MSA) Considerations for a new MSA tool Clustal Ω Benchmarking: Methods and issues Benchmarking results References Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1217 August 2011

Why a new MSA tool? Starting to see uses for MSAs with hundreds of thousands of sequences Metagenomics Next-generation sequencing Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1317 August 2011

Goals for a new MSA tool Want a tool that scales well (time and space) to hundreds of thousands of sequences and still gives accurate results Scalability: Up to several hours to align hundreds of thousands of sequences on a desktop computer Accuracy: Similar to Clustal W Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1417 August 2011

Outline Background on multiple sequence alignment (MSA) Considerations for a new MSA tool Clustal Ω Benchmarking: Methods and issues Benchmarking results References Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1517 August 2011

Clustal Ω: Possibly the last MSA tool you will need Building guide tree: Use mBed to cluster in time O(N log(N)) Progressive alignment: Use HHalign to sequentially align pairs of profile HMMs Take advantage of existing alignments External profile alignment: Use an existing profile HMM of sequences homologous to input set to help align input set Iterate guide tree construction and/or progressive alignment Add sequences to existing alignments without starting from scratch Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1617 August 2011

Building guide tree using mBed Reduces quadratic time/space of clustering and guide-tree construction to O(N log(N)) 1.Cluster sequences a.Select log 2 (N) seed sequences b.Compute distance from each sequence to all seeds, using k-tuple distance measure (k=2) for unaligned sequences. c.Cluster sequences using k-means 2.Build guide tree a.Construct UPGMA sub-tree separately for each cluster (use UPGMA code from Muscle) b.Link sub-trees using distances between clusters Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1717 August 2011

Progressive alignment using HHalign HHalign is a method for pairwise alignment of profile HMMs It was designed to search HMM databases to identify remote homologs (sequence identity <20%) In Clustal Ω, sequences and sub-alignments are converted to profile HMMs. Transition, insertion, and deletion probabilities are computed, and pseudo-counts are added as needed. HHalign is used to align sub-alignments, in the order defined by the guide tree. Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1817 August 2011

External profile alignment (EPA) Take advantage of existing HMMs to guide pairwise alignment in early stages – avoid seemingly good alignments that are bad in the context of the entire MSA If the kinds of sequences are known, can often find a relevant HMM in Pfam. Contribution of external profile decreases as sub-alignments get larger, as larger sub-alignments contain the information that would come from the external profile. Overhead: Can triple the alignment time Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 1917 August 2011

EPA performance Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2017 August 2011

Iteration instead of EPA Can bootstrap profile information if external profile is not available or not desired MSA of original sequences can be converted to HMM and used as in EPA MSA can also be used to rebuild guide tree Can iterate this process Can decouple iteration of guide-tree construction and HMM construction – can freeze one and just iterate the other, or iterate both Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2117 August 2011

Iteration performance Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2217 August 2011

Availability of Clustal Ω Download a copy (Unix/Linux, Windows, Mac) EBI website Galaxy analysis system (coming soon?) Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2317 August 2011

Outline Background on multiple sequence alignment (MSA) Considerations for a new MSA tool Clustal Ω Benchmarking: Methods and issues Benchmarking results References Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2417 August 2011

Benchmark databases for MSA BAliBASE Collection of manually refined MSAs based on 3D structural superposition Annotated core blocks: highly conserved regions that can be reliably aligned Occasionally updated to represent kinds of complex sequences encountered in real problems, as kinds of alignments attempted change. Divided into reference sets that represent different kinds of alignment challenges Other MSA benchmark DBs: Prefab, Homstrad, Oxbench, SABmark, IRMbase Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2517 August 2011

Clustal Ω benchmarking approach Compared to 11 other MSA programs Score is fraction of columns identical in generated and reference alignments Used 3 benchmark databases BAliBASE: Consider only core regions of alignments Prefab HomFam: Created for this work to test scalability to many sequences. Combined Homstrad families with corresponding Pfam families. Only tested with “fast” tools. Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2617 August 2011

Problems with benchmarking databases DBs include questionable alignments DBs have biased coverage of fold families and kinds of proteins Test results may be biased if similar methods used to construct DB and in MSA tool (e.g., pairwise alignment method) Focus on core blocks over-estimates accuracy because these regions are more easily aligned Including gaps is problematic: Gap position is not considered, and a misplaced gap can improve the accuracy score. Amount of sequence divergence in DB alignments: twilight zone (20-35% identity) vs. higher or lower Sum-of-pairs vs. column scores Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2717 August 2011

Problems with benchmarking databases, cont’d. How representative is the benchmark? Method may behave well on benchmark, not in real world Method may behave well in real world, not on benchmark Conclusion of Edgars: “protein alignment assessment is more challenging than generally realized, and skepticism is appropriate for claims that method rankings or advances can be reliably measured by current benchmarks.” Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2817 August 2011

Outline Background on multiple sequence alignment (MSA) Considerations for a new MSA tool Clustal Ω Benchmarking: Methods and issues Benchmarking results References Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 2917 August 2011

BAliBASE benchmark Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 3017 August 2011

Prefab benchmark Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 3117 August 2011

HomFam benchmark Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 3217 August 2011

Scalability of running time Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 3317 August 2011

Outline Background on multiple sequence alignment (MSA) Considerations for a new MSA tool Clustal Ω Benchmarking: Methods and issues Benchmarking results References Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 3417 August 2011

Additional references Notredame et al. (2002), “T-Coffee: A novel method for fast and accurate multiple sequence alignment,” J Mol Biol 302:205. [Introduced notion of consistency] Blackshields et al. (2010), “Sequence embedding for fast construction of guide trees for multiple sequence alignment,” Algorithms for Mol Biol 5:21. [mBed algorithm] Söding (2005), “Protein homology detection by HMM-HMM comparison,” Bioinformatics 21:951. [HHalign algorithm] Thompson et al. (2005), “BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark,” Proteins 61:127. Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 3517 August 2011

Additional references, cont’d. Mizuguchi et al. (1998), “HOMSTRAD: A database of protein structure alignments for homologous families,” Protein Sci 7:2469. Edgar (2004), “MUSCLE: Multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res 32:1792. [Introduced PREFAB benchmarking DB] Edgar (2010), “Quality measures for protein alignment benchmarks,” Nucleic Acids Res 38:2145. Aniba et al. (2010), “Issues in bioinformatics benchmarking: The case study of multiple sequence alignment,” Nucleic Acids Res 38:7353. Clustal Omega for Protein Multiple Sequence Alignment – Hershel SaferPage 3617 August 2011