Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple.
COFFEE: an objective function for multiple sequence alignments
Structural bioinformatics
BNFO 602 Multiple sequence alignment Usman Roshan.
Bioinformatics and Phylogenetic Analysis
Expected accuracy sequence alignment
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
3D-COFFEE Mixing Sequences and Structures Cédric Notredame.
Multiple sequence alignment
Catherine S. Grasso Christopher J. Lee Multiple Sequence Alignment Construction, Visualization, and Analysis Using Partial Order Graphs.
Biology 4900 Biocomputing.
Multiple Sequence Alignment
“Homology-enhanced probabilistic consistency” multiple sequence alignment : a case study on transmembrane protein Jia-Ming Chang 2013-July-09 Chang, J-M,
An Introduction to Multiple Sequence Alignments Cédric Notredame.
Protein Tertiary Structure Prediction
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,
Integrating Biological Information In Multiple Sequence Alignments Confronting Bits and Pieces of Information Cédric Notredame CNRS-Marseille, France
Multiple sequence alignment
Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame.
Classifying MSA Packages Multiple Sequence Alignments in the Genome Era Cédric Notredame Information Génétique et Structurale CNRS-Marseille, France.
T-Coffee tutorial ACGT Retreat 2012 Jean-François Taly, Ionas Erb and Cedrik Magis.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Aligning Sequences With T-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF An introduction to multiple alignments © Cédric Notredame.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
T-COFFEE, a novel method for Multiple Sequence Alignments Cédric Notredame.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
最佳的多重序列比對方法針對基因組 領域 Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
T-COFFEE, a novel method for combining biological information Cédric Notredame.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple Sequence Alignment
ncRNA Multiple Alignments with R-Coffee
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Local alignment
Recent Progress in Multiple Sequence Alignments: A Survey
Multiple Sequence Alignment
An Introduction to Multiple Sequence Alignments
An Introduction to Multiple Sequence Alignments
Multiply Aligning RNA Sequences
Homology Modeling.
Protein structure prediction.
T-Coffee: What’s New in The Grinder
MULTIPLE SEQUENCE ALIGNMENT
Basic Local Alignment Search Tool
Presentation transcript:

Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

What is T-Coffee ? Tree Based Consistency based Objective Function for Alignment Evaluation – Progressive Alignment – Consistency

Progressive Alignment Feng and Dolittle, 1988; Taylor 1989 Clustering

Dynamic Programming Using A Substitution Matrix Progressive Alignment

-Depends on the ORDER of the sequences (Tree). -Depends on the CHOICE of the sequences. -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

Consistency? Consistency is an attempt to use alignment information at very early stages

T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqD THE ---- FA-T CAT

T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqD THE ---- FA-T CAT SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT

T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT

T-Coffee and Concistency…

Where Do The Primary Alignments Come From? Primary Alignments – Primary Library Source – Any valid Third Party Method

T-Coffee and Concistency…

Using the T-Coffee Multiple Sequence Alignment Package II – M-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

What is the Best MSA method ? More than 50 MSA methods Some methods are fast and inacurate – Mafft, muscle, kalign Some methods are slow and accurate – T-Coffee, ProbCons Some Methods are slow and inacurate… – ClustalW

Why Not Combining Them ? All Methods give different alignments Their Agreement is an indication of accuracy t_coffee –method mafft_msa, muscle_msa

Combining Many MSAs into ONE MUSCLE MAFFT ClustalW ??????? T-Coffee

Where to Trust Your Alignments Most Methods Agree Most Methods Disagree

What To Do Without Structures

Using the T-Coffee Multiple Sequence Alignment Package III – Template Based Alignments Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

Sometimes Sequences are Not Enough Sequence based alignments are limited in accuracy – 30% for proteins – 70% for DNA It is hard to align correctly sequences whose similarity is below these values – Twilight zone

One Solution: Template Based Alignment Replace the sequence with something more informative – PDB Structure Expresso – Profile PSI-Coffee – RNA-StructureR-Coffee

Template Based Multiple Sequence Alignments -Structure -Profile -… Sources Templates Library Template Aligner Template Alignment Source Template Alignment Remove Templates Templates -Structure -Profile -…

Expresso: Finding the Right Structure Sources Templates Library BLAST SAP Template Alignment Source Template Alignment Remove Templates Templates

PSI-Coffee: Homology Extension Sources Templates Library BLAST Template Alignment Source Template Alignment Remove Templates Templates Profile Aligner

What is Homology Extension ? LL L ? -Simple scoring schemes result in alignment ambiguities

What is Homology Extension ? LL L L L L L L L L L I V I L L L L L L L Profile 1 Profile 2

What is Homology Extension ? LL L L L L L L L L L I V I L L L L L L L Profile 1 Profile 2

Method TemplateScoreComment ClustalW-2ProgressiveNO22.74 PRANKGapNO26.18Science2008 MAFFTIterativeNO26.18 MuscleIterativeNO31.37 ProbConsConsistencyNO40.80 ProbConsMonoPhasicNO37.53 T-CoffeeConsistencyNO42.30 M-Coffe4ConsistencyNO43.60 PSI-CoffeeConsistencyProfile53.71 PROMALConsistencyProfile55.08 PROMAL-3DConsistencyPDB D-CoffeeConsistencyPDB61.00Expresso Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

Experimental Data … TARGET Experimental Data … TARGET Template Aligner Template-Sequence Alignment Primary Library Template Alignment Template based Alignment of the Sequences Templates TARGET

Using the T-Coffee Multiple Sequence Alignment Package IV – RNA Alignments Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

ncRNAs Comparison And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” Who Are They? – tRNA, rRNA, snoRNAs, – microRNAs, siRNAs – piRNAs – long ncRNAs (Xist, Evf, Air, CTN, PINK…) How Many of them – Open question – is a common guess – Harder to detect than proteins.

ncRNAs Can Evolve Rapidly CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG ** *--**---*-**------** GAACGGACCGAACGGACC CTTGCCTGGCTTGCCTGG G G A A CC A C G G A G A C G CTTGCCTCCCTTGCCTCC GAACGGAGGGAACGGAGG G G A A CC A C G G A G A C G

The Holy Grail of RNA Comparison: Sankoff’ Algorithm

The Holy Grail of RNA Comparison Sankoff’ Algorithm Simultaneous Folding and Alignment – Time Complexity: O(L 2n ) – Space Complexity: O(L 3n ) In Practice, for Two Sequences: – 50 nucleotides: 1 min.6 M. – 100 nucleotides 16 min.256 M. – 200 nucleotides 4 hours 4 G. – 400 nucleotides3 days3 T. Forget about – Multiple sequence alignments – Database searches

RNA Sequences Secondary Structures Primary Library R-Coffee Extended Primary Library Progressive Alignment Using The R-Score RNAplfold Consan or Mafft / Muscle / ProbCons R-Coffee Extension R-Score

C C R-Coffee Extension G G TC Library G G Score X C C Score Y C C G G Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method.

R-Coffee + Regular Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R Poa Pcma Prrn ClustalW Mafft_fftnts ProbConsRNA Muscle Mafft_ginsi Improvement= # R-Coffee wins - # R-Coffee looses

RM-Coffee + Regular Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R Poa Pcma Prrn ClustalW Mafft_fftnts ProbConsRNA Muscle Mafft_ginsi RM-Coffee40.71 /0.74 / 84

R-Coffee + Structural Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R Stemloc Mlocarna Murlet Pmcomp T-Lara Foldalign Dyalign Consan RM-Coffee40.71 /0.74 / 84

Using the T-Coffee Multiple Sequence Alignment Package V – DNA Alignments Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

Aligning Genomic DNA Main problem – Tell a good alignment from a bad one Strategy: – Tuning on Orthologous Promoter Detection – Evaluation on ChIp-Seq Data

Aligning Genomic DNA Main problem – Tell a good alignment from a bad one Strategy: – Tuning on Orthologous Promoter Detection – Evaluation on ChIp-Seq Data

Aligning Genomic DNA Tuning of Gap Penalties Design of a di- nucleotide substitution matrix

Aligning Genomic DNA

gDNA is very heterogenous Each genomic feature requires its own aligner Aligning non-orthologous regions with a global aligner is impossible Pro-Coffee is designed to align orthologous promoter regions

Using the T-Coffee Multiple Sequence Alignment Package VI – Wrap Up Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

Which Flavor? Fast Alignments – M-Coffee with Fast Aligners: mafft, muscle, kalign Difficult Protein Alignments – Expresso – PSI-Coffee RNA Alignments – R-Coffee Promoter Alignments – Pro-Coffee