Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Similar presentations


Presentation on theme: "Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program."— Presentation transcript:

1 Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

2 What is T-Coffee ? Tree Based Consistency based Objective Function for Alignment Evaluation – Progressive Alignment – Consistency

3 Progressive Alignment Feng and Dolittle, 1988; Taylor 1989 Clustering

4 Dynamic Programming Using A Substitution Matrix Progressive Alignment

5 -Depends on the ORDER of the sequences (Tree). -Depends on the CHOICE of the sequences. -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

6 Consistency? Consistency is an attempt to use alignment information at very early stages

7 T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD -------- THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqD -------- THE ---- FA-T CAT

8 T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD -------- THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqD -------- THE ---- FA-T CAT SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD -------- THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT

9 T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD -------- THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT

10 T-Coffee and Concistency…

11 Where Do The Primary Alignments Come From? Primary Alignments – Primary Library Source – Any valid Third Party Method

12 T-Coffee and Concistency…

13

14 Using the T-Coffee Multiple Sequence Alignment Package II – M-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

15 What is the Best MSA method ? More than 50 MSA methods Some methods are fast and inacurate – Mafft, muscle, kalign Some methods are slow and accurate – T-Coffee, ProbCons Some Methods are slow and inacurate… – ClustalW

16 Why Not Combining Them ? All Methods give different alignments Their Agreement is an indication of accuracy t_coffee –method mafft_msa, muscle_msa

17 Combining Many MSAs into ONE MUSCLE MAFFT ClustalW ??????? T-Coffee

18

19 Where to Trust Your Alignments Most Methods Agree Most Methods Disagree

20 What To Do Without Structures

21 Using the T-Coffee Multiple Sequence Alignment Package III – Template Based Alignments Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

22 Sometimes Sequences are Not Enough Sequence based alignments are limited in accuracy – 30% for proteins – 70% for DNA It is hard to align correctly sequences whose similarity is below these values – Twilight zone

23 One Solution: Template Based Alignment Replace the sequence with something more informative – PDB Structure Expresso – Profile PSI-Coffee – RNA-StructureR-Coffee

24 Template Based Multiple Sequence Alignments -Structure -Profile -… Sources Templates Library Template Aligner Template Alignment Source Template Alignment Remove Templates Templates -Structure -Profile -…

25 Expresso: Finding the Right Structure Sources Templates Library BLAST SAP Template Alignment Source Template Alignment Remove Templates Templates

26 PSI-Coffee: Homology Extension Sources Templates Library BLAST Template Alignment Source Template Alignment Remove Templates Templates Profile Aligner

27 What is Homology Extension ? LL L ? -Simple scoring schemes result in alignment ambiguities

28 What is Homology Extension ? LL L L L L L L L L L I V I L L L L L L L Profile 1 Profile 2

29 What is Homology Extension ? LL L L L L L L L L L I V I L L L L L L L Profile 1 Profile 2

30 Method TemplateScoreComment ClustalW-2ProgressiveNO22.74 PRANKGapNO26.18Science2008 MAFFTIterativeNO26.18 MuscleIterativeNO31.37 ProbConsConsistencyNO40.80 ProbConsMonoPhasicNO37.53 T-CoffeeConsistencyNO42.30 M-Coffe4ConsistencyNO43.60 PSI-CoffeeConsistencyProfile53.71 PROMALConsistencyProfile55.08 PROMAL-3DConsistencyPDB57.60 3D-CoffeeConsistencyPDB61.00Expresso Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

31 Experimental Data … TARGET Experimental Data … TARGET Template Aligner Template-Sequence Alignment Primary Library Template Alignment Template based Alignment of the Sequences Templates TARGET

32 Using the T-Coffee Multiple Sequence Alignment Package IV – RNA Alignments Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

33 ncRNAs Comparison And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” Who Are They? – tRNA, rRNA, snoRNAs, – microRNAs, siRNAs – piRNAs – long ncRNAs (Xist, Evf, Air, CTN, PINK…) How Many of them – Open question – 30.000 is a common guess – Harder to detect than proteins.

34 ncRNAs Can Evolve Rapidly CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG **-------*--**---*-**------** GAACGGACCGAACGGACC CTTGCCTGGCTTGCCTGG G G A A CC A C G G A G A C G CTTGCCTCCCTTGCCTCC GAACGGAGGGAACGGAGG G G A A CC A C G G A G A C G

35 The Holy Grail of RNA Comparison: Sankoff’ Algorithm

36 The Holy Grail of RNA Comparison Sankoff’ Algorithm Simultaneous Folding and Alignment – Time Complexity: O(L 2n ) – Space Complexity: O(L 3n ) In Practice, for Two Sequences: – 50 nucleotides: 1 min.6 M. – 100 nucleotides 16 min.256 M. – 200 nucleotides 4 hours 4 G. – 400 nucleotides3 days3 T. Forget about – Multiple sequence alignments – Database searches

37 RNA Sequences Secondary Structures Primary Library R-Coffee Extended Primary Library Progressive Alignment Using The R-Score RNAplfold Consan or Mafft / Muscle / ProbCons R-Coffee Extension R-Score

38 C C R-Coffee Extension G G TC Library G G Score X C C Score Y C C G G Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method.

39 R-Coffee + Regular Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R ----------------------------------------------------------- Poa0.620.650.70 48154 Pcma0.620.640.67 34120 Prrn0.640.610.66-63 45 ClustalW0.650.650.69 -7 83 Mafft_fftnts0.680.680.72 17 68 ProbConsRNA0.690.670.71-49 39 Muscle0.690.690.73-17 42 Mafft_ginsi0.700.680.72-49 39 ----------------------------------------------------------- Improvement= # R-Coffee wins - # R-Coffee looses

40 RM-Coffee + Regular Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R ----------------------------------------------------------- Poa0.620.650.70 48154 Pcma0.620.640.67 34120 Prrn0.640.610.66-63 45 ClustalW0.650.650.69 -7 83 Mafft_fftnts0.680.680.72 17 68 ProbConsRNA0.690.670.71-49 39 Muscle0.690.690.73-17 42 Mafft_ginsi0.700.680.72-49 39 ----------------------------------------------------------- RM-Coffee40.71 /0.74 / 84

41 R-Coffee + Structural Aligners MethodAvg BraliscoreNet Improv. direct +T+R +T+R ----------------------------------------------------------- Stemloc0.620.750.76 104113 Mlocarna0.660.690.71 101133 Murlet0.730.700.72 -132-73 Pmcomp0.730.730.73 142145 T-Lara0.740.740.69 -36 -8 Foldalign0.750.770.77 72 73 ----------------------------------------------------------- Dyalign---0.630.62 ------ Consan---0.790.79 ------ ----------------------------------------------------------- RM-Coffee40.71 /0.74 / 84

42 Using the T-Coffee Multiple Sequence Alignment Package V – DNA Alignments Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

43 Aligning Genomic DNA Main problem – Tell a good alignment from a bad one Strategy: – Tuning on Orthologous Promoter Detection – Evaluation on ChIp-Seq Data

44 Aligning Genomic DNA Main problem – Tell a good alignment from a bad one Strategy: – Tuning on Orthologous Promoter Detection – Evaluation on ChIp-Seq Data

45 Aligning Genomic DNA Tuning of Gap Penalties Design of a di- nucleotide substitution matrix

46 Aligning Genomic DNA

47 gDNA is very heterogenous Each genomic feature requires its own aligner Aligning non-orthologous regions with a global aligner is impossible Pro-Coffee is designed to align orthologous promoter regions

48 Using the T-Coffee Multiple Sequence Alignment Package VI – Wrap Up Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

49 Which Flavor? Fast Alignments – M-Coffee with Fast Aligners: mafft, muscle, kalign Difficult Protein Alignments – Expresso – PSI-Coffee RNA Alignments – R-Coffee Promoter Alignments – Pro-Coffee

50 www.tcoffee.org

51


Download ppt "Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program."

Similar presentations


Ads by Google