Classifying MSA Packages Multiple Sequence Alignments in the Genome Era Cédric Notredame Information Génétique et Structurale CNRS-Marseille, France.

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple.
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
Clustal W and Clustal X version 2.0 김영호, 박준호, 최현희 The 9 th Protein Folding Winter School.
CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
11 Ch6 multiple sequence alignment methods 1 Biologists produce high quality multiple sequence alignment by hand using knowledge of protein sequence evolution.
Expected accuracy sequence alignment
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 07/01/08 Multiple sequence alignment 2 Sequence analysis 2007 Optimizing.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
3D-COFFEE Mixing Sequences and Structures Cédric Notredame.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
An Introduction to Multiple Sequence Alignments Cédric Notredame.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs.
Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments Susan Bibeault June 9, 2000.
BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.
Integrating Biological Information In Multiple Sequence Alignments Confronting Bits and Pieces of Information Cédric Notredame CNRS-Marseille, France
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Aligning Sequences With T-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF An introduction to multiple alignments © Cédric Notredame.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
T-COFFEE, a novel method for Multiple Sequence Alignments Cédric Notredame.
Expected accuracy sequence alignment Usman Roshan.
Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Protein multiple sequence alignment by hybrid bio-inspired algorithms Vincenzo Cutello, Giuseppe Nicosia*, Mario Pavone and Igor Prizzi Nucleic Acids Research,
最佳的多重序列比對方法針對基因組 領域 Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
T-COFFEE, a novel method for combining biological information Cédric Notredame.
Aligning Kinases Applying MSA Analysis to the CDK family.
Lab 4.11 Lab 4.1: Multiple Sequence Alignment Jennifer Gardy Molecular Biology & Biochemistry Simon Fraser University.
Topic 3: MSA Iterative Algorithms in Multiple Sequence Alignment Prepared By: 1. Chan Wei Luen 2. Lim Chee Chong 3. Poon Wei Koot 4. Xu Jin Mei 5. Yuan.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Multiple Sequence Alignment
ncRNA Multiple Alignments with R-Coffee
Recent Progress in Multiple Sequence Alignments: A Survey
An Introduction to Multiple Sequence Alignments
Multiply Aligning RNA Sequences
Olivier Poirot, Eamonn O'Toole and Cedric Notredame
T-Coffee: What’s New in The Grinder
Introduction to bioinformatics 2007 Lecture 9
Presentation transcript:

Classifying MSA Packages Multiple Sequence Alignments in the Genome Era Cédric Notredame Information Génétique et Structurale CNRS-Marseille, France

What’s in a Multiple Alignment? Structural Criteria – Residues are arranged so that those playing a similar role end up in the same column. Evolutive Criteria – Residues are arranged so that those having the same ancestor end up in the same column. Similarity Criteria – As many similar residues as possible in the same column

What’s in a Multiple Alignment?

The MSA contains what you put inside… You can view your MSA as: – A record of evolution – A summary of a protein family – A collection of experiments made for you by Nature…

What’s in a Multiple Alignment?

Multiple Alignments: What Are They Good For???

Computing the Correct Alignement is a Complicated Problem

A Taxonomy of Multiple Sequence Alignment Packages Objective Function Assembly Algorithms

The Objective Function

The Assembly Algorithm

A Tale of Three Algorithms Progressive: ClustalW Iterative: Muscle Concistency Based: T-Coffee and Probcons

ClustalW Algorithm Paula Hogeweg: First Description (1981) Taylor, Dolittle: Reinvention in 1989 Higgins: Most Successful Implementation

ClustalW

Muscle Algorithm: Using The Iteration AMPS: First iterative Algorithm (Barton, 1987) Stochastic methods: Genetic Algorithms and Simulated Annealing (Notredame, 1995) Prrp: Ancestor of MUSCLE and MAFT (1996) Muscle: the most succesful iterative strategy to this day

Muscle Algorithm: Using The Iteration

Concistency Based Algorithms Gotoh (1990) – Iterative strategy using concistency Martin Vingron (1991) – Dot Matrices Multiplications – Accurate but too stringeant Dialign (1996, Morgenstern) – Concistency – Agglomerative Assembly T-Coffee (2000, Notredame) – Concistency – Progressive algorithm ProbCons (2004, Do) – T-Coffee with a Bayesian Treatment

T-Coffee and Concistency…

Probcons: A bayesian T-Coffee Score=  (MIN(xz,zk))/MAX(xz,zk) Score(xi ~ yj | x, y, z)  ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

Evaluating Methods… Who is the best? Says who…?

Structures Vs Sequences

Evaluating Alignments Quality: Collections and Results

Evaluating Alignments Quality Collections Homstrad: The most Ancient SAB: Yet Another Benchmark Prefab: The most extensive and automated BaliBase: the first designed for MSA benchmarks (Recently updated)

Homstrad (Mizuguchi, Blundell, Overington, 1998) Hand Curated Structure Superposition Not designed for Multiple Alignments Biased with ClustalW No CORE annotation Hom +0 Hom +3 Hom +8

Homstrad: Known issues Thiored.aln 1aaza mfkvygydsnihkcvycdnakrlltvkk-----qpf 1ego mqtvifgrs----gcpycvrakdlaeklsnerddfqy 1thx skgviti-tdaefesevlkae-qpvlvyfwaswcgpcqlmsplinlaantys---drlkv 2trxa sdkiihl-tddsfdtdvlkad-gailvdfwaewcgpckmiapildeiadeyq---gkltv 3trx --mvkqiesktafqealdaagdklvvvdfsatwcgpckmikpffhslsekys----nvif 3grx anveiytke----tcpyshrakallsskg-----vsf :. 1aaza efinimpekgvfddekiaelltklgrdtqigltmpqvfapd----gshigg---fdqlre 1ego qyvdirae-----gitkedlqqkagkp---vetvpqifv-d----qqhigg---ytdfaa 1thx vkleid pnpttvkkykve-----gvpalrlvkgeqildstegviskdklls 2trxa aklnid qnpgtapkygir-----giptlllfkngevaatkvgalskgqlke 3trx levdvd dcqdvasecevk-----ctptfqffkkgqkvgefsgan-keklea 3grx qelpidgn-----aakreemikrsgr-----ttvpqifi-d----aqhigg---yddlya : :. *.. *.:

Homstrad

SAB (Wale, 2003) Multiple Structural Alignments of distantly related sequences TWs: very low similarity (250 MSAs) TWd: Low Similarity (480 MSAs) SABs +0 TWs +3 TWs +8

SAB

Prefab (Edgar, 2003) Automatic Pairwise Structural Alignments Align Pairs of Structures with Two Methods to define CORES Add 50 intermediate sequences with PSI-BLAST Large dataset (1675 MSAs) Align with CE and FSSP Prefab Add Intermediate Sequences with Psi-Blast

Prefab (MUSCLE Reference Dataset)

Who is the Best??? N. MSAsT-CoffeeProbconsMuscle Hom SABs SABf Prefab

A Case for reading papers The FFT of MAFFT

G-INS-i, H-INS-i and F-INS-i use pairwise alignment information when constructing a multiple alignment. The two options ([HF]-INS-i) incorporate local alignment information and do NOT USE FFT.

Improving T-Coffee Ease The Use Heterogenous Information – 3DCoffee Speed up the algorithm – T-CoffeeDPA (Double Progressive Algorithm) – Parallel T-Coffee (collaboration with EPFL)

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

T-Coffee-DPA DPA: Double Progressive ALN Target: seq Principle: DC Progressive ALN Application: Decreasing Redundancy

Who is the Best ??? Most Packages claim to be more accurate than T-Coffee, few really are… None of the existing packages is concistently the best: The PERFECT method does not exist

Conclusion Concistency Based Methods Have an Edge over Conventional – Better management of the data – Better extension possibilities Hard to tell Methods Appart – Reference databases are not very precise – Algorithms evolve quickly Sequence Alignment is NOT a solved problem – Will be solved when Structure Prediction is solved

Conclusion

Fabrice Armougom Sebastien Moretti Olivier Poirot Karsten Sure Chantal Abergel Des Higgins Orla O’Sullivan Iain Wallace

Amazon.co.uk: 12/11/05 Amazon.com: 12/11/05 Barnes&Noble (US): 12/11/05 Dissemination: The right Vector

Cadrie Notredom et Michael Claverie

T-Coffee-DPA T-Coffee-DPA is about 20 times faster than the Standard T-Coffee Preliminary tests indicate a slightly higher accuracy Beta-Test versions will be available by September but can will be sent on request.

3D TCoffeeDPA Vs The Human Kinome… 521 sequences 46 structures having 80% or more sequence identity with other kinome structures Use of 3D-CoffeeDPA (unpublished) developped especially for the kinome analysis

Structure Based Evaluation Include Sequences with Known Structures – Do Not use Structural Information Score 1 – Use Structural Information:Score 2 Score1 Vs Score 2 – Evaluates the accuracy of reconstruction strategy – Estimates accuracy of alignment for sequences Without a known structure

How Good is Our Kinome Alignment ???

BaliBase (Thompson, 1999) Hand Made Structure Superposition All the sequences do not have Structures Comparisons are made on CORE blocks Different categories for different types of problems

Most Reference Databases Have problems: BaliBase Balibase 1abo Reference 1 1aboA -NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN GEW 1ycsB KGVIYALWDYEPQNDDELPMKEGDCMTIIHREDE deIEW 1pht GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFSDGQEARPeeIGW 1ihvA -NFRVYYRDSRD------PVWKGPAKLLWKG EGA * : * : 1aboA CEAQT--KNGQGWVPSNYITPVN ycsB WWARL--NDKEGYVPRNLLGLYP pht LNGYNETTGERGDFPGTYVEYIGRKKISP 1ihvA VVIQD--NSDIKVVPRRKAKIIRD----- Balibase 1abo Reference 2 1aboA -NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN GEW 1ycsB KGVIYALWDYEPQNDDELPMKEGDCMTIIHREDEDE IEW 1pht GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFSDGQEARPEEIGW 1ihvA -NFRVYYRDSRD------PVWKGPAKLLWKG EGA * : * : 1aboA CEAQTK--NGQGWVPSNYITPVN ycsB WWARL--NDKEGYVPRNLLGLYP pht LNGYNeTTGERGDFPGTYVEYIGRKKISP 1ihvA VVIQD--NSDIKVVPRRKAKIIRD-----

3D TCoffeeDPA Vs The Human Kinome… Sequences in our Kinome MSA dataset have been provided by Aventis Do not inlude the Alpha Kinases Assembling an exhaustive Kinome Dataset remains a target (c.f. Projects)