Recent Progress in Multiple Sequence Alignments: A Survey

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Techniques for Protein Sequence Alignment and Database Searching
Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Multiple Sequence Alignment
COFFEE: an objective function for multiple sequence alignments
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Multiple alignment: heuristics
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Needleman-Wunsch with affine gaps
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
3D-COFFEE Mixing Sequences and Structures Cédric Notredame.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
An Introduction to Multiple Sequence Alignments Cédric Notredame.
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments Susan Bibeault June 9, 2000.
Multiple sequence alignment
Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF An introduction to multiple alignments © Cédric Notredame.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
T-COFFEE, a novel method for Multiple Sequence Alignments Cédric Notredame.
Multiple Sequence Alignment
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
T-COFFEE, a novel method for combining biological information Cédric Notredame.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Topic 3: MSA Iterative Algorithms in Multiple Sequence Alignment Prepared By: 1. Chan Wei Luen 2. Lim Chee Chong 3. Poon Wei Koot 4. Xu Jin Mei 5. Yuan.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Multiple Sequence Alignment
INTRODUCTION TO BIOINFORMATICS
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Techniques for Protein Sequence Alignment and Database Searching
Multiple Sequence Alignment
An Introduction to Multiple Sequence Alignments
An Introduction to Multiple Sequence Alignments
Techniques for Protein Sequence Alignment and Database Searching
In Bioinformatics use a computational method - Dynamic Programming.
Introduction to Bioinformatics
Introduction to bioinformatics 2007 Lecture 9
Computational Genomics Lecture #3a
MULTIPLE SEQUENCE ALIGNMENT
Presentation transcript:

Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame

Our Scope What are The existing Methods? How Do They Work: -Assemby Algorithms -Weighting Schemes. When Do They Work ? Which Future?

Outline -Introduction -A taxonomy of the existing Packages -A few algorithms… -Performance Comparison using BaliBase

Introduction

What Is A Multiple Sequence Alignment? A MSA is a MODEL It Indicates the RELATIONSHIP between residues of different sequences. LIKE ANY MODEL It REVEALS -Similarities -Inconsistencies

How Can I Use A Multiple Sequence Alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Motifs/Patterns Multiple Alignments Are CENTRAL to MOST Bioinformatics Techniques. Profiles Phylogeny Struc. Prediction

How Can I Use A Multiple Sequence Alignment? Multiple Alignments Is the most INTEGRATIVE Method Available Today. We Need MSA to INCORPORATE existing DATA

Why Is It Difficult To Compute A multiple Sequence Alignment? A CROSSROAD PROBLEM BIOLOGY: What is A Good Alignment COMPUTATION What is THE Good Alignment chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

Why Is It Difficult To Compute A multiple Sequence Alignment ? BIOLOGY COMPUTATION CIRCULAR PROBLEM.... Good Good Sequences Alignment

A Taxonomy of Multiple Sequence Alignment Methods

Grouping According to the assembly Algorithm

Simultaneous As opposed to Progressive [Simultaneous: they simultaneously use all the information] Exact As opposed to Heursistic [Heuristics: cut corners like Blast Vs SW] [Heuristics: do not guarranty an optimal solution] Stochastic As opposed to Determinist [Stochastic: contain an element of randomness] [Stochastic: Example of a Monte Carlo Surface estimation ] Iterative As opposed to Non Iterative [Iterative: run the same algorithm many times] [Iterative: Most stochastic methods are iterative]

Simultaneous Clustal Dialign T-Coffee Progressive MSA POA DCA Combalign Non tree based Iterative Iteralign Prrp SAM HMMer SAGA GA OMA Praline MAFFT GAs HMMs

Iterative Iteralign Prrp SAM HMMer GA Clustal Dialign T-Coffee Progressive Simultaneous MSA POA OMA Praline MAFFT DCA Combalign SAGA Stochastic

NEARLY EVERY OPTIMISATION ALGORITHM HAS BEEN APPLIED TO THE MSA PROBLEM!!!

Grouping According to the Objective Function

Scoring an Alignment: Evolutionary based methods BIOLOGY How many events separate my sequences? Such an evaluation relies on a biological model. COMPUTATION Every position musd be independant

Model: ALL the sequences evolved from the same ancestor REAL Tree Model: ALL the sequences evolved from the same ancestor A A A C A A C Tree: Cost=1 A A C A C PROBLEM: We do not know the true tree

A A A C C Star Tree: Cost=2 C Model: ALL the sequences have the same ancestor A A A C A C Star Tree: Cost=2 A A C A PROBLEM: the tree star is phylogenetically wrong

C Sums of Pairs: Cost=6 A A A C Model=Every sequence is the ancestor of every sequence A C Sums of Pairs: Cost=6 A A A C [s(a,b): matrix] [i: column i] [k, l: seq index] PROBLEM: -over-estimation of the mutation costs -Requires a weighting scheme

Some of itslimitations (Durbin, p140) Sums of Pairs: Some of itslimitations (Durbin, p140) L L L Cost= 5*N*(N-1)/2 [5: Leucine Vs Leucine with Blosum50] Cost=5*N*(N-1)/2-(5)*(N-1) - (-4)*(N-1) [glycine effect] Cost=5*N*(N-1)/2-(9)*(N-1) G

Some of its limitations (Durbin, p140) Sums of Pairs: Some of its limitations (Durbin, p140) L L L G Delta= 2*(9)*(N-1) 5*N*(N-1) = (9) 5*N N Delta Conclusion: The more Leucine, the less expensive it gets to add a Glycin to the column...

Enthropy based Functions Model: Minimize the enthropy (variety) in each Column [number of Alanine (a) in column i] A A A C [Score of column i] [a: alphabet] [P can incorporate pseudocounts] S=0 if the column is conserved PROBLEM: -requires a simultaneous alignment -assumes independant sequences

Consistency based Functions Model: Maximise the consistency (agreement) with a list of constraints (alignments) [kand l are sequences, i is a column] A A A C [the two residues are found aligned in the list of constraints] PROBLEM: -requires a list of constraints

Prrp Clustal POA MSA MAFFT OMA DCA SAGA Weighted Sums of Pairs Concistency Based Iteralign Dialign T-Coffee Praline Combalign Enthropy SAM HMMer GIBBS

A few Multiple Sequence Alignment Algorithms

MSA and DCA POA ClustalW MAFFT Dialign II Prrp SAGA GIBBS Sampler A Few Algorithms MSA and DCA POA ClustalW MAFFT Dialign II Prrp SAGA GIBBS Sampler

Simultaneous: MSA and DCA

Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Memory and CPU hungry -Do Well When They Can Run.

MSA: the carillo and Lipman bounds chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ( ) S = ( ) S chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE + ) ( chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP S … [Pairwise projection of sequences k and l]

MSA: the carillo and Lipman bounds a(k,l)=score of the projection k l in the optimal MSA S(a(x,y))=score of the complete multiple alignment â(k,l)=score of the optimal alignment of k l Upper Lower a(k,m) â(k,m) â(k,l) ? a(k,l)

MSA: the carillo and Lipman bounds LM: a lower bound for the complete MSA LM<=S(â(x,y)) - (â(k,l)-a(k,l)) a(k,l)>=LM +â(k,l)-S(â(x,y)) a(k,l) â(k,l) LM+ â(k,l)-S(â(x,y)) ?

MSA: the carillo and Lipman bounds â(k,l) LM+ â(k,l)-S(â(x,y)) a(k,l) ä(k,l) â(k,l) LM: can be measured on ANY heuristic alignment LM = S(ä(x,y)) The better LM, the tighter the bounds…

MSA: the carillo and Lipman bounds Best( M-i, N-j) Best( 0-i, 0-j) + M M Forward backward

Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Memory and CPU hungry -Do Well When They Can Run.

Simultaneous Alignments : DCA -Few Small Closely Related Sequence, but less limited than MSA -Do Well When Can Run. -Memory and CPU hungry, but less than MSA

Simultaneous With a New Sequence Representaion: POA-Partial Ordered Graph

POA POA makes it possible to represent complex relationships: -domain deletion -domain inversions

Progressive: ClustalW

Progressive Alignment: ClustalW Feng and Dolittle, 1988; Taylor 198ç Clustering

Progressive Alignment: ClustalW Dynamic Programming Using A Substitution Matrix

Tree based Alignment : Recursive Algorithm Align ( Node N) { if ( N->left_child is a Node) A1=Align ( N->left_child) else if ( N->left_child is a Sequence) A1=N->left_child if (N->right_child is a node) A2=Align (N->right_child) else if ( N->right_child is a Sequence) A2=N->right_child Return dp_alignment (A1, A2) } A B C D E F G

Progressive Alignment : ClustalW -Depends on the CHOICE of the sequences. -Depends on the ORDER of the sequences (Tree). -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

Progressive Alignment : ClustalW Weighting Weighting Within ClustalW

Progressive Alignment : ClustalW GOP Position Specific GOP

Progressive Alignment : ClustalW ClustalW is the most Popular Method -Greedy Heuristic (No Guarranty). -Fast -Scales Well: N, N L 3 2

Progressive Alignment With a Heuristic DP: MAFFT

Concistency Based Dialign II Progressive And Concistency Based Dialign II

Dialign II 1) Identify best chain of segments on each pair of sequence. Assign a Pvalue to each Segment Pair. 2) Ré-évaluate each segment pair according to its consistency with the others 3) Assemble the alignment according to the segment pairs.

Dialign II -May Align Too Few Residues -No Gap Penalty -Does well with ESTs

Concistency Based T-COFFEE Progressive And Concistency Based T-COFFEE

Mixing Local and Global Alignments Multiple Sequence Alignment Local Alignment Global Alignment Extension Multiple Sequence Alignment

Library Based Multiple Sequence Alignment What is a library? 3 Seq1 anotherseq Seq2 atsecondone Seq3 athirdone #1 2 1 1 25 #1 3 3 8 70 …. 2 Seq1 MySeq Seq2 MyotherSeq #1 2 1 1 25 3 8 70 …. Extension+T-Coffee Library Based Multiple Sequence Alignment

Iterative

7.16.1 Progressive Iterative Methods -HMMs, HMMER, SAM. -Slow, Sometimes Inaccurate -Good Profile Generators

Iterative Methods : Prrp Initial Alignment Tree and weights computation YES Weights converged End Outer Iteration NO Realign two sub-groups Inner Iteration YES Alignment converged NO

SAGA, The Genetic Algorithm Iterative Sochastic: SAGA, The Genetic Algorithm

Automatic scheduling of the operators

Weighting Schemes

The Problem The sequences Contain Correlated Information Most scoring Schemes Ignore this Correlation

Weighting Sequence Pairs with a Tree: Carillo and Lipman Rationale I

QUESTION: Which Weight for a Pair of Sequences E=EDGE P=Evolutive Path from A to X E must contribute the same weight to every path P that goes throught it. Nk: Number of Edges meeting on Node k. A B C D E F G All the weights using E must sum to 1: S(WP,E)=1. Wp= P(Nk-1) 1

USAGE

PROBLEM: Weight Depends only on the Tree topology A C AB: 0.5 AC: 0.5 BC: 0.5. B A C AB: 0.5 AC: 0.5 BC: 0.5.

Weighting Sequences with a Tree Clustal W Weights

S QUESTION: Which Weight for Sequences ? W=Length *1/4 W=Length *1/2 A B C D E F G G W=S(W) Number Sequences Sharing Edge Edge Length Wseq = S

USAGE

PROBLEM: Overweight of distant sequences -C Will dominate the Alignment -C Will be very Difficult to align

Performance Comparison Using Collections of Reference Alignments: BaliBase and Ribosomal RNA

What Is BaliBase BaliBase BaliBase is a collection of reference Multiple Alignments The Structure of the Sequences are known and were used to assemble the MALN. Evaluation is carried out by Comparing the Structure Based Reference Alignment With its Sequence Based Counterpart

What Is BaliBase BaliBase  DALI, Sap … Method X Comparison

What Is BaliBase BaliBase Source: BaliBase, Thompson et al, NAR, 1999, PROBLEM Description Even Phylogenic Spread. One Outlayer Sequence Two Distantly related Groups Long Internal Indel Long Terminal Indel

Choosing The Right Method

Choosing The Right Method (POA Evaluation)

Choosing The Right Method (POA Evaluation)

Choosing The Right Method (MAFFT evaluation)

Choosing The Right Method (MAFFT evaluation)

Choosing The Right Method (MAFFT evaluation)

Conclusion

What Is BaliBase Which Method ? Source: BaliBase, Thompson et al, NAR, 1999, PROBLEM Strategy Strategy ClustalW, T-coffee, MSA, DCA T-Coffee PrrP, T-Coffee Dialign T-Coffee Dialign T-Coffee

Methods /Situtations 1-Carillo and Lipman: 2-Segment Based: -MSA, DCA. -Few Small Closely Related Sequence. -Do Well When They Can Run. 2-Segment Based: -DIALIGN, MACAW. -May Align Too Few Residues -Good For Long Indels 3-Iterative: -HMMs, HMMER, SAM. -Slow, Sometimes Inaccurate -Good Profile Generators 4-Progressive: -ClustalW, Pileup, Multalign… -Fast and Sensitive

Addresses MAFFT Progressive www.biophys.kyoto-u.jp/katoh POA Progressive/Simulataneous www.bioinformatics.ucla.edu/poa MUSCLE Progressive/Iterative www.drive5.com/muscle/