Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple.
COFFEE: an objective function for multiple sequence alignments
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Structural bioinformatics
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Expected accuracy sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
3D-COFFEE Mixing Sequences and Structures Cédric Notredame.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
An Introduction to Multiple Sequence Alignments Cédric Notredame.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Protein Sequence Alignment and Database Searching.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.
Integrating Biological Information In Multiple Sequence Alignments Confronting Bits and Pieces of Information Cédric Notredame CNRS-Marseille, France
Multiple sequence alignment
Classifying MSA Packages Multiple Sequence Alignments in the Genome Era Cédric Notredame Information Génétique et Structurale CNRS-Marseille, France.
T-Coffee tutorial ACGT Retreat 2012 Jean-François Taly, Ionas Erb and Cedrik Magis.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Aligning Sequences With T-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Construction of Substitution matrices
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
T-COFFEE, a novel method for Multiple Sequence Alignments Cédric Notredame.
Expected accuracy sequence alignment Usman Roshan.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
最佳的多重序列比對方法針對基因組 領域 Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
T-COFFEE, a novel method for combining biological information Cédric Notredame.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple Sequence Alignment
Independent scientist
Introduction to Bioinformatics Resources for DNA Barcoding
Distance based phylogenetics
ncRNA Multiple Alignments with R-Coffee
The ideal approach is simultaneous alignment and tree estimation.
An Introduction to Multiple Sequence Alignments
Multiply Aligning RNA Sequences
Explore Evolution: Instrument for Analysis
T-Coffee: What’s New in The Grinder
MULTIPLE SEQUENCE ALIGNMENT
Basic Local Alignment Search Tool
Presentation transcript:

Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

Which Tool for Which Sequence ?

In- SilVo Biology In Silico Biology – Making Sense of digital data In Vivo Biology – Recording data in a living Cell In SilVo Biology – Connect In-Vitro and In-Vivo In-Vivo: High-throughput recording In-Silico: High-Throughput analysis

Is it Possible to Compare all Types of Sequences ? Non Transcribed World – Genes/Full Genomes Lagan, TBA – Promoter Regions Meta-Aligner Motifs Finders – Nucleosome ??? Multiple Genome Aligners – Not Very Accurate – Very Fast – Deal with rearrangements

Multiple Genome Alignments and re-sequencing Before – Re-sequence Human Genomes – Map the Reads onto the reference genome Now – Re-sequence – Assemble – Align – Non trivial with very large datasets

Is it Possible to Compare all Types of Sequences ? RNA Comparison – Less Accurate than Proteins – Secondary Structures ncRNA World – Sankoff Time O(L 2n ) Space O(L 3n ) – Consan – R-Coffee

Is it Possible to Compare all Types of Sequences ? Protein Comparisons – Very Accurate – 3D-Structure Improves it Protein Aligners – ClustalW – T-Coffee – 3D-Coffee

What Changes with 1000 Genomes?

Phylogeny Vs Function Function – Low level => Biochemistry => Protein Domains – High Level => Metabolic Pathway => Orthology Orthology – Phylogenetic Analysis – Phylogenetic Analysis =>Accurate Alignments

Using The tree Correct Tree  Correct Orthologous Assignment  Correct Functional Prediction

The Alignment that Hides The Forest…

Why So Much interest for MSA methods???

Phylogenetic Trees and Multiple Sequence Alignments

Positive Selection and MSAs

Multiple Genome Alignments

Genomic Era: The Goal Sequences: interspecies 1 Billion: Re-sequencing Incorporation of ALL experimental Data – Structure, Genomic, ChIp-Chip, ChIp-Seq… Alignments suitable for all applications of comparative genomics – Homology Modeling (function) – Functional Analysis – Phylogenetic Reconstruction – 3D-Modelling Accurate Alignments for ALL kind of data Non Transcribed DNA Transcribed DNA Translated DNA

Genomic Era Challenges Accuracy – Proteins: 30% is the limit – DNA/RNA 70% is the limit Scale – With too many sequences algorithms lose in accuracy Data Integration – Structure – Homology – Genomic Structure – Function – Proteomics Methods – Wealth of alternative methods – Poorly Characterized

Consistency and Data Integration Most methods rely on the progressive algorithm Consistency based methods have been designed as an extension Consistency based alignment methods have been designed to: – Better extract the signal contained in the data – Integrate/Confront existing methods – Integrate/Confront heterogeneous types of Information

The Progressive Alignment Algorithm

T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST CA-T --- SeqC GARFIELD THE VERY FAST CAT SeqD THE ---- FA-T CAT

T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqD THE ---- FA-T CAT

T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqD THE ---- FA-T CAT SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT

T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT

T-Coffee and Concistency…

Methods Data Scalability

A Brief History of Consistency A Long Chain of Small Contributions…

Consistency Based Algorithms Gotoh (1990) – Iterative strategy using consistency Martin Vingron (1991) – Dot Matrices Multiplications – Accurate but too stringeant Dialign (1996, Morgenstern) – Concistency – Agglomerative Assembly T-Coffee (2000, Notredame) – Concistency – Progressive algorithm ProbCons (2004, Do) – T-Coffee with a Bayesian Treatment – Biphasic Gap Penalty AMAP (Schwarz, 2007) – ProbCons Consistency – Replace Progressive alignment with simulated Annealing – Hard to distinguish from ProbCons FSA ( Patcher, 2009) – AMAP with automated parameter estimation – Hard to distinguish from ProbCons

Choosing the right modeling method M-Coffee

Combining Many MSAs into ONE MUSCLE MAFFT ClustalW ??????? T-Coffee

Consistency and Accuracy

Integrating New Types of Data Template Based Sequence Alignments

Experimental Data … TARGET Experimental Data … TARGET Template Aligner Template-Sequence Alignment Primary Library Template Alignment Template based Alignment of the Sequences Templates TARGET

Exploring The Template World TemplateGeneratorAlignment Method RNA StructurePredictionRNA Aligner Protein StructureBLAST vs PDB3D Aligner ProfileBLAST vs NRProfile/Profile Alignment Gene StructureENSEMBLGenome Aligner PromoterTransfacMeta-Aligner

Exploring The Template World TemplateGeneratorAlignment Method Mode RNA Structure PredictionRNA Aligner R-Coffee Protein Structure BLAST /PDB3D Aligner 3D-Coffee Profile BLAST/NRProfile/Profile PSI-Coffee Gene Structure ENSEMBLGenome Aligner Exoset Promoter TransfacMeta-Aligner Meta-Coffee

3D-Coffee/Expresso Incorporating Structural Information

Expresso: Finding the Right Structure Sources Templates Library BLAST SAP Template Alignment Source Template Alignment Remove Templates Templates

PSI-Coffee Homology Extension

Exploring The Template World

What is Homology Extension ? LL L ? -Simple scoring schemes result in alignment ambiguities

What is Homology Extension ? LL L L L L L L L L L I V I L L L L L L L Profile 1 Profile 2

What is Homology Extension ? LL L L L L L L L L L I V I L L L L L L L Profile 1 Profile 2

PSI-Coffee: Homology Extension Sources Templates Library BLAST Template Alignment Source Template Alignment Remove Templates Templates Profile Aligner

Benchmarks

Method TemplateScoreComment ClustalW-2ProgressiveNO22.74 PRANKGapNO26.18Science2008 MAFFTIterativeNO26.18 MuscleIterativeNO31.37 ProbConsConsistencyNO40.80 ProbConsMonoPhasicNO37.53 T-CoffeeConsistencyNO42.30 M-Coffe4ConsistencyNO43.60 PSI-CoffeeConsistencyProfile53.71 PROMALConsistencyProfile55.08 PROMAL-3DConsistencyPDB D-CoffeeConsistencyPDB61.00Expresso Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

Method TemplateScoreComment ClustalW-2ProgressiveNO22.74 PRANKGapNO26.18Science2008 MAFFTIterativeNO26.18 MuscleIterativeNO31.37 ProbConsConsistencyNO40.80 ProbConsMonoPhasicNO37.53 T-CoffeeConsistencyNO42.30 M-Coffe4ConsistencyNO43.60 PSI-CoffeeConsistencyProfile53.71 PROMALConsistencyProfile55.08 PROMAL-3DConsistencyPDB D-CoffeeConsistencyPDB61.00Expresso Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase). Consistency

Method TemplateScoreComment ClustalW-2ProgressiveNO22.74 PRANKGapNO26.18Science2008 MAFFTIterativeNO26.18 MuscleIterativeNO31.37 ProbConsConsistencyNO40.80 ProbConsMonoPhasicNO37.53 T-CoffeeConsistencyNO42.30 M-Coffe4ConsistencyNO43.60 PSI-CoffeeConsistencyProfile53.71 PROMALConsistencyProfile55.08 PROMAL-3DConsistencyPDB D-CoffeeConsistencyPDB61.00Expresso Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase). Homology Extension

Method TemplateScoreComment ClustalW-2ProgressiveNO22.74 PRANKGapNO26.18Science2008 MAFFTIterativeNO26.18 MuscleIterativeNO31.37 ProbConsConsistencyNO40.80 ProbConsMonoPhasicNO37.53 T-CoffeeConsistencyNO42.30 M-Coffe4ConsistencyNO43.60 PSI-CoffeeConsistencyProfile53.71 PROMALConsistencyProfile55.08 PROMAL-3DConsistencyPDB D-CoffeeConsistencyPDB61.00Expresso Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase). Structural Extension

T-Coffee and The World BLAST/ SOAP -Some Templates are obtained with a BLAST -Queries can be sent to the EBI or the NCBI -No Need for a Local BLAST installation Users sequences

Incorporating RNA Information Within the T-Coffee Algorithm

ncRNAs Can Evolve Rapidly CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG ** *--**---*-**------** GAACGGACCGAACGGACC CTTGCCTGGCTTGCCTGG G G A A CC A C G G A G A C G CTTGCCTCCCTTGCCTCC GAACGGAGGGAACGGAGG G G A A CC A C G G A G A C G

ncRNAs Can Evolve Rapidly CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG ** *--**---*-**------** CC--AGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAAC--GGAGG ** * *** * * *** ** Sequence Alignment (Maximizing Identity) -Incorrect -Not Predictive

The Holy Grail of RNA Comparison: Sankoff’ Algorithm

C C R-Coffee Extension G G TC Library G G Score X C C Score Y C C G G Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method.

R-Coffee + Structural Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R Stemloc Mlocarna Murlet Pmcomp T-Lara Foldalign Dyalign Consan Improvement= # R-Coffee wins - # R-Coffee looses over 170 test sets

R-Coffee + Regular Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R Poa Pcma Prrn ClustalW Mafft_fftnts ProbConsRNA Muscle Mafft_ginsi Improvement= # R-Coffee wins - # R-Coffee looses over 388 test sets

Genomic Era Challenges Conclusion Template Based Alignments Meta-Methods M-Coffee Homology Extension (Proteins) R-Coffee Scaled Consistency

Open Questions Accurately Aligning non transcribed DNA Coping with One Billion Human Genomes

Comparative Bioinformatics University College Dublin – Des Higgins – Orla O’Sullivan – Iain Wallace (UCD, IE) Berlin Free University – Knut Reinert – Tobias Rausch Swiss Intitute of Bioinformatics – Ioannis Xenarios – Sebastien Morreti Comparative Bioinformatics – Merixell Oliva – Giovanni Bussoti – Carsten Kemena – Emanuele Rainieri – Ionas Erb – Jia Ming Chang – Matthias Zytneki

Why So Much Interest For Multiple Alignments ? Extrapolation Motifs/Patterns Phylogeny Profiles Structure Prediction SNP Analysis Reactivity Analysis Regulatory Elements

Phylogeny Vs Function: Applications Comparative Genomics => New Medium New Medium => New Clinical Test

Detecting ncRNAs in silico: a long way to go… RNAse P

Obtaining the Structure of a ncRNA is difficult Hard to Align The Sequences Without the Structure Hard to Predict the Structures Without an Alignment

R-Coffee: Modifying T-Coffee at the Right Place Incorporation of Secondary Structure information within the Library Two Extra Components for the T-Coffee Scoring Scheme – A new Library – A new Scoring Scheme

G-INS-i, H-INS-i and F-INS-i use pairwise alignment information when constructing a multiple alignment. The two options ([HF]-INS-i) incorporate local alignment information and do NOT USE FFT.

Molecular Biology Within the System Biology Era Protein A Interacts with Regulates Inhibits Protein B

Molecular Biology In the 1000 Genomes Era Protein A Interacts with Regulates Inhibits Protein B Variation Within Species: CNVs of A and SNPs Conservation Across Species

System Biology vs Comparative Genomics Systems Biology  Systems can be Understood Comparative Genomics  Systems can Evolve through Selection

Phylogeny Vs Function: Applications – Important  Application – Possible  Many New Genomes – Challenging  Too Many New Genomes

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

Comparing Methods MAFFT

Some Benchmark: BB11 BaliBase BB11: 38 highly divergent (less than 25% id) datasets from BaliBase BB11: predicts 78% of the results measured on other datasets Blackshield, Higgins

PhD Fellowships

What ‘s in a Multiple Sequence Alignment Evolution Inertia Common Ancestry Shows up In the sequences Selection Important Features Are Preserved Functional Constraint Same Function Same Sequence Convergence Phylogenetic Footprint, Evolutionary Trace …

Which Tool for Which Sequence ?

Is it Possible to Compare all Types of Sequences Non Transcribed World – Genes/Full Genomes Lagan, TBA – Promoter Regions Meta-Aligner Motifs Finders – Nucleosome ??? Multiple Genome Aligners – Not Very Accurate – Very Fast – Deal with rearrangements

Is it Possible to Compare all Types of Sequences RNA Comparison – Less Accurate than Proteins – Secondary Structures ncRNA World – Consan – R-Coffee

Is it Possible to Compare all Types of Sequences Protein Comparisons – Very Accurate – 3D-Structure Improves it Protein Aligners – ClustalW – T-Coffee – 3D-Coffee

Why So Much Interest For Multiple Alignments ? Extrapolation Motifs/Patterns Phylogeny Profiles Structure Prediction SNP Analysis Reactivity Analysis Regulatory Elements

What’s in a Multiple Alignment ? The MSA contains what you put inside: – Structural Similarity – Evolutive Similarity – Sequence Similarity You can view your MSA as: – A record of evolution – A summary of a protein family – A collection of experiments made for you by Nature…

Producing The Right Alignment Multiple Sequence Alignments Influence Phylogenetic Trees Choice of Method is not Neutral – Different Methods – Different Alignments – Different Trees Using The Right Models insures Producing the right Tree

Model Based Alignments vs Naïve Alignments Naïve Alignment – Lexicographic Alignment – Maximizing the number of identities – At best using a substitution matrix Model Based Alignments – Using a model – Protein structure information – RNA Structure information – Combining/Confronting Modeling methods Template based Alignments – Model based Alignments through the use of Templates

T-Coffee and Model Based Alignments T-Coffee Algorithm Expresso: Aligning Protein Structures R-Coffee: Aligning RNA structures M-Coffee: Combining methods

T-Coffee and Concistency…

When Sequences Are not Enough 3D-Coffee and Expresso

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

Where to Trust Your Alignments Most Methods Agree Most Methods Disagree

Conclusion Model Based Alignments Give the best Accuracy Template based alignment is a very efficient way to turn Naïve aligners into model based aligners Sequence Alignments are not necessarily reliable over their entire lengths

Manguel M, Samaniego F.J., Abraham Wald’s Work on Aircraft Suvivability, J. American Statistical Association. 79, , (1984)

Building and Using Models Angstrom

Computing the Correct Alignment is a Complicated Problem

Stochastic Optimization

Exploration of Complex Optimization Problems With Multiple Constraints – Genomic Alignments – RNA Alignments Generation of Population of Suboptimal Solutions – Quality=f( optimality ) Specification of Concistency Objective Function of T- Coffee

Three Types of Algorithms Progressive: ClustalW Iterative: Muscle Concistency Based: T-Coffee and Probcons

T-Coffee and Concistency… Each Library Line is a Soft Constraint (a wish) You can’t satisfy them all You must satisfy as many as possible (The easy ones)

Concistency Based Algorithms: T-Coffee Gotoh (1990) – Iterative strategy using consistency Martin Vingron (1991) – Dot Matrices Multiplications – Accurate but too stringeant Dialign (1996, Morgenstern) – Concistency – Agglomerative Assembly T-Coffee (2000, Notredame) – Concistency – Progressive algorithm ProbCons (2004, Do) – T-Coffee with a Bayesian Treatment

How Good Is My Method ?

Structures Vs Sequences

T-Coffee Results Validation Using BaliBase

Too Many Methods for ONE Alignment M-Coffee

Estimating the Accuracy of your MSA

What To Do Without Structures

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

Expresso: Finding the Right Structure Why Not Using Structure Based Alignments

Template Based Multiple Sequence Alignments

-Structure -Profile -… Sources Templates Library Template Aligner Template Alignment Source Template Alignment Remove Templates Templates -Structure -Profile -…

MethodScoreTemplatesPrefabHomstrad ClustalWMatrix KalignMatrix MUSCLEMatrix T-CoffeeConsistency ProbConsConsistency MafftConsistency M-CoffeeConsistency MUMMALSConsistency Clustal-db MatrixProfiles PRALINEMatrixProfiles PROMALSConsistencyProfiles SPEMMatrixProfiles EXPRESSOConsistencyStructures * T-LaraConsistencyStructures Table 1. Summary of all the methods described in the review. Validation figures were compiled from several sources, and selected for the compatibility. Prefab refers to some validation made on Prefab Version 3. The HOMSTRAD validation was made on datasets having less than 30% identity. The source of each figure is indicated by a reference. *The EXPRESSO figure comes from a slightly more demanding subset of HOMSTRAD (HOM39) made of sequences less than 25% identical.

Improving The Evaluation

How Do We Perform In The Twilight Zone? Concistency Based Methods Have an Edge Hard to tell Methods Apart Sequence Alignment is NOT solved

More Than Structure based Alignments Structural Correctness Is Only the Easy Side of the Coin. In practice MSA are intermediate models used to generate other models: DataModel TypeBenchmark HomologyProfileYes EvolutionTreesNo Structure3D-StructureCASP FunctionAnnotationNo

Conclusion Template based Multiple Sequence Alignments Projecting any relevant information onto the sequences Using this Information Need for new evaluation procedures Functional Analysis Phylogenetic Analysis Homology Search (Profiles) Homology Modelling Integrating data  Making sure your bits of data can fight with one another

Turning Data into Models Data Columbus, considered that the landmass occupied 225°, leaving only 135° of water (Marinus of Tyre, 70 AD).Marinus of Tyre Columbus believed that 1° represented only 56 miles (Alfraganus, XIth century)Alfraganus He knew there was an island named Japan off the cost of China… Model Circumference of the Earth as 25,255 km at most, Canary Island to Japan : 3,700 km (Reality: 12,000 km.)

The More Structures The Merrier Average Improvement over T-Coffee Struc/Seq Ratio

The Right Mixt of Methods

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

Applications

Looking-Up The DNA Behind The Sequences: PROTOGENE

SAR Analysis Correlate Alignment Variations with Reactivity Application to the Human Kinome Collaboration with Sanofi-Aventis Main Issue: – Training problem  Proper Benchmarking

ncRNA Multiple Alignments with R-Coffee Laundering the Genome Dark Matter Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

No Plane Today…

ncRNAs Comparison And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” Who Are They? – tRNA, rRNA, snoRNAs, – microRNAs, siRNAs – piRNAs – long ncRNAs (Xist, Evf, Air, CTN, PINK…) How Many of them – Open question – is a common guess – Harder to detect than proteins.

ncRNAs can have different sequences and Similar Structures

ncRNAs are Difficult to Align Same Structure  Low Sequence Identity Small Alphabet, Short Sequences  Alignments often Non- Significant

Obtaining the Structure of a ncRNA is difficult Hard to Align The Sequences Without the Structure Hard to Predict the Structures Without an Alignment

The Holy Grail of RNA Comparison: Sankoff’ Algorithm

The Holy Grail of RNA Comparison Sankoff’ Algorithm Simultaneous Folding and Alignment – Time Complexity: O(L 2n ) – Space Complexity: O(L 3n ) In Practice, for Two Sequences: – 50 nucleotides: 1 min.6 M. – 100 nucleotides 16 min.256 M. – 200 nucleotides 4 hours 4 G. – 400 nucleotides3 days3 T. Forget about – Multiple sequence alignments – Database searches

The next best Thing: Consan Consan = Sankoff + a few constraints Use of Stochastic Context Free Grammars – Tree-shaped HMMs – Made sparse with constraints The constraints are derived from the most confident positions of the alignment Equivalent of Banded DP

Going Multiple…. Structural Aligners

Game Rules Using Structural Predictions – Produces better alignments – Is Computationally expensive Use as much structural information as possible while doing as little computation as possible…

Adapting T-Coffee To RNA Alignments

T-Coffee and Concistency…

Consistency: Conflicts and Information X Y X Z Y WZ X Z Y Z W Y W X Z Y Z X W Y Z X W Partly Consistent  Less Reliable Fully Consistent  More Reliable Y is unhappy X is unhappy X Y

RNA Sequences Secondary Structures Primary Library R-Coffee Extended Primary Library Progressive Alignment Using The R-Score RNAplfold Consan or Mafft / Muscle / ProbCons R-Coffee Extension R-Score

C C R-Coffee Scoring Scheme G G R-Score (CC)=MAX(TC-Score(CC), TC-Score (GG))

Validating R-Coffee

RNA Alignments are harder to validate than Protein Alignments Protein Alignments  Use of Structure based Reference Alignments RNA Alignments  No Real structure based reference alignments – The structures are mostly predicted from sequences – Circularity

BraliBase and the BraliScore Database of Reference Alignments 388 multiple sequence alignments. Evenly distributed between 35 and 95 percent average sequence identity Contain 5 sequences selected from the RNA family database Rfam The reference alignment is based on a SCFG model based on the full Rfam seed dataset (~100 sequences).

BraliBase SPS Score RFam MSA Number of Identically Aligned Pairs SPS= Number of Aligned Pairs

BraliBase: SCI Score RNApfoldRNApfold (((…)))…((..))  G Seq1 (((…)))…((..))  G Seq2 (((…)))…((..))  G Seq 3 (((…)))…((..))  G Seq4 (((…)))…((..))  G Seq5 (((…)))…((..))  G Seq6 RNAlifold (((…)))…((..)) ALN  G Average  G Seq X Cov  G ALN SCI= Covariance

BRaliScore Braliscore= SCI*SPS

RM-Coffee + Regular Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R Poa Pcma Prrn ClustalW Mafft_fftnts ProbConsRNA Muscle Mafft_ginsi RM-Coffee40.71 /0.74 / 84

How Best is the Best…. M-Locarna234 ***183 ** Stral169 *** 62 FoldalignM Murlet130 *-12 Rnasampler129 *-27 T-Lara125 *-30 Poa241 ***217 *** T-Coffee241 ***199 *** Prrn232 ***198 *** Pcma218 ***151 *** Proalign216 ***150 ** Mafft fftns206 ***148 * ClustalW203 ***136 *** Probcons192 ***128 * Mafft ginsi170 ***115 Muscle169 ***111 Method vs. R-Coffee-Consan vs. RM-Coffee4

Range of Performances Effect of Compensated Mutations

Conclusion/Future Directions T-Coffee/Consan is currently the best MSA protocol for ncRNAs Testing how important is the accuracy of the secondary structure prediction Going deeper into Sankoff’s territory: predicting and aligning simultaneously

Credits and Web Servers Andreas Wilm Des Higgins Sebastien Moretti Ioannis Xenarios Cedric Notredame CGR, SIB, UCD

Prank Vs Prank

Gop=0 Gep=0

Prank Vs Prank The reconstruction of evolutionary homology -- including the correct placement of insertion and deletion events -- is only feasible for rather closely-related sequences. PRANK is not meant for the alignment of very diverged protein sequences. If sequences are very different, the correct homology cannot be reconstructed with confidence and -srv/prank/

Do Benchmarks All Tell the same story? Based on

Probcons: Different Primary Library Score=  (MIN(xz,zk))/MAX(xz,zk) Score(xi ~ yj | x, y, z)  ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)