Download presentation
Presentation is loading. Please wait.
Published byAlexis Pope Modified over 9 years ago
1
Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame
2
Cédric Notredame (07/11/2015) Our Scope What are The existing Methods? How Do They Work: -Assemby Algorithms -Weighting Schemes. When Do They Work ? Which Future?
3
Cédric Notredame (07/11/2015) Outline -Introduction -A taxonomy of the existing Packages -A few algorithms… -Performance Comparison using BaliBase
4
Cédric Notredame (07/11/2015) Introduction
5
Cédric Notredame (07/11/2015) What Is A Multiple Sequence Alignment? A MSA is a MODEL It Indicates the RELATIONSHIP between residues of different sequences. It REVEALS -Similarities -Inconsistencies LIKE ANY MODEL
6
Cédric Notredame (07/11/2015) chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Motifs/Patterns Phylogeny Profiles Struc. Prediction Multiple Alignments Are CENTRAL to MOST Bioinformatics Techniques.
7
Cédric Notredame (07/11/2015) How Can I Use A Multiple Sequence Alignment? Multiple Alignments Is the most INTEGRATIVE Method Available Today. We Need MSA to INCORPORATE existing DATA
8
Cédric Notredame (07/11/2015) Why Is It Difficult To Compute A multiple Sequence Alignment? A CROSSROAD PROBLEM BIOLOGY: What is A Good Alignment COMPUTATION What is THE Good Alignment chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: *
9
Cédric Notredame (07/11/2015) Why Is It Difficult To Compute A multiple Sequence Alignment ? BIOLOGY CIRCULAR PROBLEM.... Good Sequences Good Alignment COMPUTATION
10
Cédric Notredame (07/11/2015) A Taxonomy of Multiple Sequence Alignment Methods
11
Cédric Notredame (07/11/2015) Grouping According to the assembly Algorithm
12
Cédric Notredame (07/11/2015) SimultaneousAs opposed to Progressive ExactAs opposed to Heursistic StochasticAs opposed to Determinist IterativeAs opposed to Non Iterative [Simultaneous: they simultaneously use all the information] [Heuristics: cut corners like Blast Vs SW] [Heuristics: do not guarranty an optimal solution] [Stochastic: contain an element of randomness] [Stochastic: Example of a Monte Carlo Surface estimation ] [Iterative: Most stochastic methods are iterative] [Iterative: run the same algorithm many times]
13
Cédric Notredame (07/11/2015) Iterative Iteralign Prrp SAMHMMer SAGA GA Clustal Dialign T-Coffee Progressive Simultaneous MSA POA OMA Praline MAFFT DCA Combalign Non tree based GAs HMMs
14
Cédric Notredame (07/11/2015) Iterative Iteralign Prrp SAMHMMer GA Clustal Dialign T-Coffee Progressive Simultaneous MSA POA OMA Praline MAFFT DCA Combalign StochasticSAGA
15
Cédric Notredame (07/11/2015) NEARLY EVERY OPTIMISATION ALGORITHM HAS BEEN APPLIED TO THE MSA PROBLEM!!!
16
Cédric Notredame (07/11/2015) Grouping According to the Objective Function
17
Cédric Notredame (07/11/2015) Scoring an Alignment: Evolutionary based methods BIOLOGY How many events separate my sequences? Such an evaluation relies on a biological model. COMPUTATION Every position musd be independant
18
Cédric Notredame (07/11/2015) REAL Tree Model: ALL the sequences evolved from the same ancestor A A A C Tree: Cost=1 C A AAACCAAACC A C A PROBLEM: We do not know the true tree
19
Cédric Notredame (07/11/2015) STAR Tree Model: ALL the sequences have the same ancestor A A A C Star Tree: Cost=2 C A AAACCAAACC A PROBLEM: the tree star is phylogenetically wrong
20
Cédric Notredame (07/11/2015) Sums of Pairs Model=Every sequence is the ancestor of every sequence A A A C Sums of Pairs: Cost=6 C A AAACCAAACC PROBLEM: -over-estimation of the mutation costs -Requires a weighting scheme [s(a,b): matrix] [i: column i] [k, l: seq index]
21
Cédric Notredame (07/11/2015) Sums of Pairs: Some of itslimitations (Durbin, p140) L LLLLLLLLLL G Cost=5*N*(N-1)/2-(5)*(N-1) - (-4)*(N-1) [glycine effect] Cost=5*N*(N-1)/2-(9)*(N-1) Cost= 5*N*(N-1)/2 [5: Leucine Vs Leucine with Blosum50]
22
Cédric Notredame (07/11/2015) Sums of Pairs: Some of its limitations (Durbin, p140) L LLLLLLLLLL G Delta= 2*(9)*(N-1) 5*N*(N-1) = (9) 5*N N Delta Conclusion: The more Leucine, the less expensive it gets to add a Glycin to the column...
23
Cédric Notredame (07/11/2015) Enthropy based Functions Model: Minimize the enthropy (variety) in each Column A AAACCAAACC PROBLEM: -requires a simultaneous alignment -assumes independant sequences [number of Alanine (a) in column i] [Score of column i] [a: alphabet] [P can incorporate pseudocounts] S=0 if the column is conserved
24
Cédric Notredame (07/11/2015) Consistency based Functions Model: Maximise the consistency (agreement) with a list of constraints (alignments) PROBLEM: -requires a list of constraints A AAACCAAACC [kand l are sequences, i is a column] [the two residues are found aligned in the list of constraints]
25
Cédric Notredame (07/11/2015) Concistency Based Iteralign Dialign T-Coffee Praline Combalign Prrp Clustal POA MSA MAFFT OMA DCA SAGA Weighted Sums of Pairs Enthropy SAM HMMer GIBBS
26
Cédric Notredame (07/11/2015) A few Multiple Sequence Alignment Algorithms
27
Cédric Notredame (07/11/2015) A Few Algorithms MSA and DCA ClustalW Dialign II Prrp SAGA GIBBS Sampler MAFFT POA
28
Cédric Notredame (07/11/2015) Simultaneous: MSA and DCA
29
Cédric Notredame (07/11/2015) Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Do Well When They Can Run. -Memory and CPU hungry
30
Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP S ( ) = S ( S ( ) ) + … [Pairwise projection of sequences k and l]
31
Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds a(k,l)=score of the projection k l in the optimal MSA â(k,l)=score of the optimal alignment of k l (a(x,y))=score of the complete multiple alignment a(k,l) â(k,l) a(k,m)â(k,m) ? Upper Lower
32
Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds LM: a lower bound for the complete MSA a(k,l)>=LM +â(k,l)- (â(x,y)) LM<= (â(x,y)) - (â(k,l)-a(k,l)) a(k,l)â(k,l) LM+ â(k,l)- (â(x,y)) ?
33
Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds LM: can be measured on ANY heuristic alignment a(k,l)â(k,l) LM+ â(k,l)- (â(x,y)) ä(k,l) LM = (ä(x,y)) The better LM, the tighter the bounds…
34
Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds backward Forward Best( M-i, N-j)Best( 0-i, 0-j) 0 M N 0 M N +
35
Cédric Notredame (07/11/2015) Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Do Well When They Can Run. -Memory and CPU hungry
36
Cédric Notredame (07/11/2015) Simultaneous Alignments : DCA -Few Small Closely Related Sequence, but less limited than MSA -Do Well When Can Run. -Memory and CPU hungry, but less than MSA
37
Cédric Notredame (07/11/2015) Simultaneous With a New Sequence Representaion: POA-Partial Ordered Graph
38
Cédric Notredame (07/11/2015)
40
POA POA makes it possible to represent complex relationships: -domain deletion -domain inversions
41
Cédric Notredame (07/11/2015) Progressive: ClustalW
42
Cédric Notredame (07/11/2015) Progressive Alignment: ClustalW Feng and Dolittle, 1988; Taylor 198ç Clustering
43
Cédric Notredame (07/11/2015) Dynamic Programming Using A Substitution Matrix Progressive Alignment: ClustalW
44
Cédric Notredame (07/11/2015) Tree based Alignment : Recursive Algorithm Align ( Node N) { if ( N->left_child is a Node) A1=Align ( N->left_child) else if ( N->left_child is a Sequence) A1=N->left_child if (N->right_child is a node) A2=Align (N->right_child) else if ( N->right_child is a Sequence) A2=N->right_child Return dp_alignment (A1, A2) } AD E FG C B
45
Cédric Notredame (07/11/2015) Progressive Alignment : ClustalW -Depends on the ORDER of the sequences (Tree). -Depends on the CHOICE of the sequences. -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.
46
Cédric Notredame (07/11/2015) Weighting Within ClustalW Progressive Alignment : ClustalW Weighting
47
Cédric Notredame (07/11/2015) Position Specific GOP Progressive Alignment : ClustalW GOP
48
Cédric Notredame (07/11/2015) ClustalW is the most Popular Method -Fast -Greedy Heuristic (No Guarranty). Progressive Alignment : ClustalW -Scales Well: N, N L 3 22
49
Cédric Notredame (07/11/2015) Progressive Alignment With a Heuristic DP: MAFFT
50
Cédric Notredame (07/11/2015)
51
Progressive And Concistency Based Dialign II
52
Cédric Notredame (07/11/2015) Dialign II 1) Identify best chain of segments on each pair of sequence. Assign a Pvalue to each Segment Pair. 3) Assemble the alignment according to the segment pairs. 2) Ré-évaluate each segment pair according to its consistency with the others
53
Cédric Notredame (07/11/2015) Dialign II -May Align Too Few Residues -No Gap Penalty -Does well with ESTs
54
Cédric Notredame (07/11/2015) Progressive And Concistency Based T-COFFEE
55
Cédric Notredame (07/11/2015) Mixing Local and Global Alignments Local AlignmentGlobal Alignment Extension Multiple Sequence Alignment
56
Cédric Notredame (07/11/2015) What is a library? Extension+T-Coffee Library Based Multiple Sequence Alignment 2 Seq1 MySeq Seq2 MyotherSeq #1 2 1 1 25 3 8 70 …. 3 Seq1 anotherseq Seq2 atsecondone Seq3 athirdone #1 2 1 1 25 #1 3 3 8 70 ….
57
Cédric Notredame (07/11/2015) Iterative
58
Cédric Notredame (07/11/2015) 7.16.1 Progressive Iterative Methods -HMMs, HMMER, SAM. -Slow, Sometimes Inaccurate -Good Profile Generators
59
Cédric Notredame (07/11/2015) 7.16.2 Prrp Initial Alignment Tree and weights computation Weights converged End Realign two sub-groups Alignment converged YES NO YES NO Inner Iteration Outer Iteration Iterative Methods : Prrp
60
Cédric Notredame (07/11/2015) Iterative Sochastic: SAGA, The Genetic Algorithm
61
Cédric Notredame (07/11/2015)
64
Automatic scheduling of the operators
65
Cédric Notredame (07/11/2015)
66
Weighting Schemes
67
Cédric Notredame (07/11/2015) The Problem The sequences Contain Correlated Information Most scoring Schemes Ignore this Correlation
68
Cédric Notredame (07/11/2015) Weighting Sequence Pairs with a Tree: Carillo and Lipman Rationale I
69
Cédric Notredame (07/11/2015) A DE F G CB E=EDGE P=Evolutive Path from A to X E must contribute the same weight to every path P that goes throught it. QUESTION: Which Weight for a Pair of Sequences All the weights using E must sum to 1: (W P,E)=1. Wp= N k-1 ) 1 Nk: Number of Edges meeting on Node k.
70
Cédric Notredame (07/11/2015) USAGE
71
Cédric Notredame (07/11/2015) PROBLEM: Weight Depends only on the Tree topology B A C AB: 0.5 AC: 0.5 BC: 0.5. B A C AB: 0.5 AC: 0.5 BC: 0.5.
72
Cédric Notredame (07/11/2015) Weighting Sequences with a Tree Clustal W Weights
73
Cédric Notredame (07/11/2015) G A DE FCB QUESTION: Which Weight for Sequences ? W=Length *1/4 W=Length *1/2 W=Length *1 G G W= W) Number Sequences Sharing Edge Edge Length W seq =
74
Cédric Notredame (07/11/2015) USAGE
75
Cédric Notredame (07/11/2015) PROBLEM: Overweight of distant sequences DE F G C -C Will dominate the Alignment -C Will be very Difficult to align
76
Cédric Notredame (07/11/2015) Performance Comparison Using Collections of Reference Alignments: BaliBase and Ribosomal RNA
77
Cédric Notredame (07/11/2015) What Is BaliBase BaliBase BaliBase is a collection of reference Multiple Alignments The Structure of the Sequences are known and were used to assemble the MALN. Evaluation is carried out by Comparing the Structure Based Reference Alignment With its Sequence Based Counterpart
78
Cédric Notredame (07/11/2015) What Is BaliBase BaliBase DALI, Sap … Method X Comparison
79
Cédric Notredame (07/11/2015) What Is BaliBase BaliBase Description PROBLEM Source: BaliBase, Thompson et al, NAR, 1999, Even Phylogenic Spread. One Outlayer Sequence Two Distantly related Groups Long Internal Indel Long Terminal Indel
80
Cédric Notredame (07/11/2015) Choosing The Right Method
81
Cédric Notredame (07/11/2015) Choosing The Right Method (POA Evaluation)
82
Cédric Notredame (07/11/2015) Choosing The Right Method (POA Evaluation)
83
Cédric Notredame (07/11/2015) Choosing The Right Method (MAFFT evaluation)
84
Cédric Notredame (07/11/2015) Choosing The Right Method (MAFFT evaluation)
85
Cédric Notredame (07/11/2015) Choosing The Right Method (MAFFT evaluation)
86
Cédric Notredame (07/11/2015) Conclusion
87
Cédric Notredame (07/11/2015) What Is BaliBase Which Method ? PROBLEM Source: BaliBase, Thompson et al, NAR, 1999, Strategy ClustalW, T-coffee, MSA, DCA PrrP, T-Coffee Dialign T-Coffee Dialign T-Coffee
88
Cédric Notredame (07/11/2015) Methods /Situtations 1-Carillo and Lipman: -MSA, DCA. -Few Small Closely Related Sequence. 2-Segment Based: -DIALIGN, MACAW. -May Align Too Few Residues -Good For Long Indels -Do Well When They Can Run. 3-Iterative: -HMMs, HMMER, SAM. -Slow, Sometimes Inaccurate -Good Profile Generators 4-Progressive: -ClustalW, Pileup, Multalign… -Fast and Sensitive
89
Cédric Notredame (07/11/2015) Addresses MAFFT Progressive www.biophys.kyoto-u.jp/katoh POA Progressive/Simulataneous www.bioinformatics.ucla.edu/poa MUSCLE Progressive/Iterative www.drive5.com/muscle/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.