Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program
What is T-Coffee ? Tree Based Consistency based Objective Function for Alignment Evaluation – Progressive Alignment – Consistency
Progressive Alignment Feng and Dolittle, 1988; Taylor 1989 Clustering
Dynamic Programming Using A Substitution Matrix Progressive Alignment
-Depends on the ORDER of the sequences (Tree). -Depends on the CHOICE of the sequences. -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.
Consistency? Consistency is an attempt to use alignment information at very early stages
T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqD THE ---- FA-T CAT
T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqD THE ---- FA-T CAT SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT
T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT
T-Coffee and Concistency…
Where Do The Primary Alignments Come From? Primary Alignments – Primary Library Source – Any valid Third Party Method
T-Coffee and Concistency…
Using the T-Coffee Multiple Sequence Alignment Package II – M-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program
What is the Best MSA method ? More than 50 MSA methods Some methods are fast and inacurate – Mafft, muscle, kalign Some methods are slow and accurate – T-Coffee, ProbCons Some Methods are slow and inacurate… – ClustalW
Why Not Combining Them ? All Methods give different alignments Their Agreement is an indication of accuracy t_coffee –method mafft_msa, muscle_msa
Combining Many MSAs into ONE MUSCLE MAFFT ClustalW ??????? T-Coffee
Where to Trust Your Alignments Most Methods Agree Most Methods Disagree
What To Do Without Structures
Using the T-Coffee Multiple Sequence Alignment Package III – Template Based Alignments Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program
Sometimes Sequences are Not Enough Sequence based alignments are limited in accuracy – 30% for proteins – 70% for DNA It is hard to align correctly sequences whose similarity is below these values – Twilight zone
One Solution: Template Based Alignment Replace the sequence with something more informative – PDB Structure Expresso – Profile PSI-Coffee – RNA-StructureR-Coffee
Template Based Multiple Sequence Alignments -Structure -Profile -… Sources Templates Library Template Aligner Template Alignment Source Template Alignment Remove Templates Templates -Structure -Profile -…
Expresso: Finding the Right Structure Sources Templates Library BLAST SAP Template Alignment Source Template Alignment Remove Templates Templates
PSI-Coffee: Homology Extension Sources Templates Library BLAST Template Alignment Source Template Alignment Remove Templates Templates Profile Aligner
What is Homology Extension ? LL L ? -Simple scoring schemes result in alignment ambiguities
What is Homology Extension ? LL L L L L L L L L L I V I L L L L L L L Profile 1 Profile 2
What is Homology Extension ? LL L L L L L L L L L I V I L L L L L L L Profile 1 Profile 2
Method TemplateScoreComment ClustalW-2ProgressiveNO22.74 PRANKGapNO26.18Science2008 MAFFTIterativeNO26.18 MuscleIterativeNO31.37 ProbConsConsistencyNO40.80 ProbConsMonoPhasicNO37.53 T-CoffeeConsistencyNO42.30 M-Coffe4ConsistencyNO43.60 PSI-CoffeeConsistencyProfile53.71 PROMALConsistencyProfile55.08 PROMAL-3DConsistencyPDB D-CoffeeConsistencyPDB61.00Expresso Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).
Experimental Data … TARGET Experimental Data … TARGET Template Aligner Template-Sequence Alignment Primary Library Template Alignment Template based Alignment of the Sequences Templates TARGET
Using the T-Coffee Multiple Sequence Alignment Package IV – RNA Alignments Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program
ncRNAs Comparison And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” Who Are They? – tRNA, rRNA, snoRNAs, – microRNAs, siRNAs – piRNAs – long ncRNAs (Xist, Evf, Air, CTN, PINK…) How Many of them – Open question – is a common guess – Harder to detect than proteins.
ncRNAs Can Evolve Rapidly CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG ** *--**---*-**------** GAACGGACCGAACGGACC CTTGCCTGGCTTGCCTGG G G A A CC A C G G A G A C G CTTGCCTCCCTTGCCTCC GAACGGAGGGAACGGAGG G G A A CC A C G G A G A C G
The Holy Grail of RNA Comparison: Sankoff’ Algorithm
The Holy Grail of RNA Comparison Sankoff’ Algorithm Simultaneous Folding and Alignment – Time Complexity: O(L 2n ) – Space Complexity: O(L 3n ) In Practice, for Two Sequences: – 50 nucleotides: 1 min.6 M. – 100 nucleotides 16 min.256 M. – 200 nucleotides 4 hours 4 G. – 400 nucleotides3 days3 T. Forget about – Multiple sequence alignments – Database searches
RNA Sequences Secondary Structures Primary Library R-Coffee Extended Primary Library Progressive Alignment Using The R-Score RNAplfold Consan or Mafft / Muscle / ProbCons R-Coffee Extension R-Score
C C R-Coffee Extension G G TC Library G G Score X C C Score Y C C G G Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method.
R-Coffee + Regular Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R Poa Pcma Prrn ClustalW Mafft_fftnts ProbConsRNA Muscle Mafft_ginsi Improvement= # R-Coffee wins - # R-Coffee looses
RM-Coffee + Regular Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R Poa Pcma Prrn ClustalW Mafft_fftnts ProbConsRNA Muscle Mafft_ginsi RM-Coffee40.71 /0.74 / 84
R-Coffee + Structural Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R Stemloc Mlocarna Murlet Pmcomp T-Lara Foldalign Dyalign Consan RM-Coffee40.71 /0.74 / 84
Using the T-Coffee Multiple Sequence Alignment Package V – DNA Alignments Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program
Aligning Genomic DNA Main problem – Tell a good alignment from a bad one Strategy: – Tuning on Orthologous Promoter Detection – Evaluation on ChIp-Seq Data
Aligning Genomic DNA Main problem – Tell a good alignment from a bad one Strategy: – Tuning on Orthologous Promoter Detection – Evaluation on ChIp-Seq Data
Aligning Genomic DNA Tuning of Gap Penalties Design of a di- nucleotide substitution matrix
Aligning Genomic DNA
gDNA is very heterogenous Each genomic feature requires its own aligner Aligning non-orthologous regions with a global aligner is impossible Pro-Coffee is designed to align orthologous promoter regions
Using the T-Coffee Multiple Sequence Alignment Package VI – Wrap Up Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program
Which Flavor? Fast Alignments – M-Coffee with Fast Aligners: mafft, muscle, kalign Difficult Protein Alignments – Expresso – PSI-Coffee RNA Alignments – R-Coffee Promoter Alignments – Pro-Coffee