Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recent Progress in Multiple Sequence Alignments: A Survey

Similar presentations


Presentation on theme: "Recent Progress in Multiple Sequence Alignments: A Survey"— Presentation transcript:

1 Recent Progress in Multiple Sequence Alignments: A Survey
Cédric Notredame

2 Our Scope What are The existing Methods? How Do They Work:
-Assemby Algorithms -Weighting Schemes. When Do They Work ? Which Future?

3 Outline -Introduction -A taxonomy of the existing Packages
-A few algorithms… -Performance Comparison using BaliBase

4 Introduction

5 What Is A Multiple Sequence Alignment?
A MSA is a MODEL It Indicates the RELATIONSHIP between residues of different sequences. LIKE ANY MODEL It REVEALS -Similarities -Inconsistencies

6 How Can I Use A Multiple Sequence Alignment?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: : * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Motifs/Patterns Multiple Alignments Are CENTRAL to MOST Bioinformatics Techniques. Profiles Phylogeny Struc. Prediction

7 How Can I Use A Multiple Sequence Alignment?
Multiple Alignments Is the most INTEGRATIVE Method Available Today. We Need MSA to INCORPORATE existing DATA

8 Why Is It Difficult To Compute A multiple Sequence Alignment?
A CROSSROAD PROBLEM BIOLOGY: What is A Good Alignment COMPUTATION What is THE Good Alignment chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: : * . *: *

9 Why Is It Difficult To Compute A multiple Sequence Alignment ?
BIOLOGY COMPUTATION CIRCULAR PROBLEM.... Good Good Sequences Alignment

10 A Taxonomy of Multiple Sequence Alignment Methods

11 Grouping According to the assembly Algorithm

12 Simultaneous As opposed to Progressive
[Simultaneous: they simultaneously use all the information] Exact As opposed to Heursistic [Heuristics: cut corners like Blast Vs SW] [Heuristics: do not guarranty an optimal solution] Stochastic As opposed to Determinist [Stochastic: contain an element of randomness] [Stochastic: Example of a Monte Carlo Surface estimation ] Iterative As opposed to Non Iterative [Iterative: run the same algorithm many times] [Iterative: Most stochastic methods are iterative]

13 Simultaneous Clustal Dialign T-Coffee Progressive MSA POA DCA Combalign Non tree based Iterative Iteralign Prrp SAM HMMer SAGA GA OMA Praline MAFFT GAs HMMs

14 Iterative Iteralign Prrp SAM HMMer GA Clustal Dialign T-Coffee Progressive Simultaneous MSA POA OMA Praline MAFFT DCA Combalign SAGA Stochastic

15 NEARLY EVERY OPTIMISATION ALGORITHM HAS BEEN APPLIED TO THE MSA PROBLEM!!!

16 Grouping According to the Objective Function

17 Scoring an Alignment: Evolutionary based methods
BIOLOGY How many events separate my sequences? Such an evaluation relies on a biological model. COMPUTATION Every position musd be independant

18 Model: ALL the sequences evolved from the same ancestor
REAL Tree Model: ALL the sequences evolved from the same ancestor A A A C A A C Tree: Cost=1 A A C A C PROBLEM: We do not know the true tree

19 A A A C C Star Tree: Cost=2 C
Model: ALL the sequences have the same ancestor A A A C A C Star Tree: Cost=2 A A C A PROBLEM: the tree star is phylogenetically wrong

20 C Sums of Pairs: Cost=6 A A A C
Model=Every sequence is the ancestor of every sequence A C Sums of Pairs: Cost=6 A A A C [s(a,b): matrix] [i: column i] [k, l: seq index] PROBLEM: -over-estimation of the mutation costs -Requires a weighting scheme

21 Some of itslimitations (Durbin, p140)
Sums of Pairs: Some of itslimitations (Durbin, p140) L L L Cost= 5*N*(N-1)/2 [5: Leucine Vs Leucine with Blosum50] Cost=5*N*(N-1)/2-(5)*(N-1) - (-4)*(N-1) [glycine effect] Cost=5*N*(N-1)/2-(9)*(N-1) G

22 Some of its limitations (Durbin, p140)
Sums of Pairs: Some of its limitations (Durbin, p140) L L L G Delta= 2*(9)*(N-1) 5*N*(N-1) = (9) 5*N N Delta Conclusion: The more Leucine, the less expensive it gets to add a Glycin to the column...

23 Enthropy based Functions
Model: Minimize the enthropy (variety) in each Column [number of Alanine (a) in column i] A A A C [Score of column i] [a: alphabet] [P can incorporate pseudocounts] S=0 if the column is conserved PROBLEM: -requires a simultaneous alignment -assumes independant sequences

24 Consistency based Functions
Model: Maximise the consistency (agreement) with a list of constraints (alignments) [kand l are sequences, i is a column] A A A C [the two residues are found aligned in the list of constraints] PROBLEM: -requires a list of constraints

25 Prrp Clustal POA MSA MAFFT OMA DCA SAGA Weighted Sums of Pairs Concistency Based Iteralign Dialign T-Coffee Praline Combalign Enthropy SAM HMMer GIBBS

26 A few Multiple Sequence Alignment Algorithms

27 MSA and DCA POA ClustalW MAFFT Dialign II Prrp SAGA GIBBS Sampler
A Few Algorithms MSA and DCA POA ClustalW MAFFT Dialign II Prrp SAGA GIBBS Sampler

28 Simultaneous: MSA and DCA

29 Simultaneous Alignments : MSA
1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Memory and CPU hungry -Do Well When They Can Run.

30 MSA: the carillo and Lipman bounds
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ( ) S = ( ) S chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE + ) ( chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP S [Pairwise projection of sequences k and l]

31 MSA: the carillo and Lipman bounds
a(k,l)=score of the projection k l in the optimal MSA S(a(x,y))=score of the complete multiple alignment â(k,l)=score of the optimal alignment of k l Upper Lower a(k,m) â(k,m) â(k,l) ? a(k,l)

32 MSA: the carillo and Lipman bounds
LM: a lower bound for the complete MSA LM<=S(â(x,y)) - (â(k,l)-a(k,l)) a(k,l)>=LM +â(k,l)-S(â(x,y)) a(k,l) â(k,l) LM+ â(k,l)-S(â(x,y)) ?

33 MSA: the carillo and Lipman bounds
â(k,l) LM+ â(k,l)-S(â(x,y)) a(k,l) ä(k,l) â(k,l) LM: can be measured on ANY heuristic alignment LM = S(ä(x,y)) The better LM, the tighter the bounds…

34 MSA: the carillo and Lipman bounds
Best( M-i, N-j) Best( 0-i, 0-j) + M M Forward backward

35 Simultaneous Alignments : MSA
1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Memory and CPU hungry -Do Well When They Can Run.

36 Simultaneous Alignments : DCA
-Few Small Closely Related Sequence, but less limited than MSA -Do Well When Can Run. -Memory and CPU hungry, but less than MSA

37 Simultaneous With a New Sequence Representaion:
POA-Partial Ordered Graph

38

39

40 POA POA makes it possible to represent complex relationships: -domain deletion -domain inversions

41 Progressive: ClustalW

42 Progressive Alignment: ClustalW Feng and Dolittle, 1988; Taylor 198ç
Clustering

43 Progressive Alignment: ClustalW
Dynamic Programming Using A Substitution Matrix

44 Tree based Alignment : Recursive Algorithm
Align ( Node N) { if ( N->left_child is a Node) A1=Align ( N->left_child) else if ( N->left_child is a Sequence) A1=N->left_child if (N->right_child is a node) A2=Align (N->right_child) else if ( N->right_child is a Sequence) A2=N->right_child Return dp_alignment (A1, A2) } A B C D E F G

45 Progressive Alignment : ClustalW
-Depends on the CHOICE of the sequences. -Depends on the ORDER of the sequences (Tree). -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

46 Progressive Alignment : ClustalW Weighting
Weighting Within ClustalW

47 Progressive Alignment : ClustalW GOP
Position Specific GOP

48 Progressive Alignment : ClustalW
ClustalW is the most Popular Method -Greedy Heuristic (No Guarranty). -Fast -Scales Well: N, N L 3 2

49 Progressive Alignment With a Heuristic DP: MAFFT

50

51 Concistency Based Dialign II
Progressive And Concistency Based Dialign II

52 Dialign II 1) Identify best chain of segments on each pair of sequence. Assign a Pvalue to each Segment Pair. 2) Ré-évaluate each segment pair according to its consistency with the others 3) Assemble the alignment according to the segment pairs.

53 Dialign II -May Align Too Few Residues -No Gap Penalty
-Does well with ESTs

54 Concistency Based T-COFFEE
Progressive And Concistency Based T-COFFEE

55 Mixing Local and Global Alignments Multiple Sequence Alignment
Local Alignment Global Alignment Extension Multiple Sequence Alignment

56 Library Based Multiple Sequence Alignment
What is a library? 3 Seq1 anotherseq Seq2 atsecondone Seq3 athirdone #1 2 #1 3 …. 2 Seq1 MySeq Seq2 MyotherSeq #1 2 …. Extension+T-Coffee Library Based Multiple Sequence Alignment

57 Iterative

58 7.16.1 Progressive Iterative Methods -HMMs, HMMER, SAM.
-Slow, Sometimes Inaccurate -Good Profile Generators

59 Iterative Methods : Prrp
Initial Alignment Tree and weights computation YES Weights converged End Outer Iteration NO Realign two sub-groups Inner Iteration YES Alignment converged NO

60 SAGA, The Genetic Algorithm
Iterative Sochastic: SAGA, The Genetic Algorithm

61

62

63

64 Automatic scheduling of the operators

65

66 Weighting Schemes

67 The Problem The sequences Contain Correlated Information
Most scoring Schemes Ignore this Correlation

68 Weighting Sequence Pairs with a Tree:
Carillo and Lipman Rationale I

69 QUESTION: Which Weight for a Pair of Sequences
E=EDGE P=Evolutive Path from A to X E must contribute the same weight to every path P that goes throught it. Nk: Number of Edges meeting on Node k. A B C D E F G All the weights using E must sum to 1: S(WP,E)=1. Wp= P(Nk-1) 1

70 USAGE

71 PROBLEM: Weight Depends only on the Tree topology
A C AB: 0.5 AC: 0.5 BC: 0.5. B A C AB: 0.5 AC: 0.5 BC: 0.5.

72 Weighting Sequences with a Tree
Clustal W Weights

73 S QUESTION: Which Weight for Sequences ? W=Length *1/4 W=Length *1/2
A B C D E F G G W=S(W) Number Sequences Sharing Edge Edge Length Wseq = S

74 USAGE

75 PROBLEM: Overweight of distant sequences
-C Will dominate the Alignment -C Will be very Difficult to align

76 Performance Comparison Using Collections of Reference Alignments: BaliBase and Ribosomal RNA

77 What Is BaliBase BaliBase BaliBase is a collection of reference Multiple Alignments The Structure of the Sequences are known and were used to assemble the MALN. Evaluation is carried out by Comparing the Structure Based Reference Alignment With its Sequence Based Counterpart

78 What Is BaliBase BaliBase DALI, Sap … Method X Comparison

79 What Is BaliBase BaliBase
Source: BaliBase, Thompson et al, NAR, 1999, PROBLEM Description Even Phylogenic Spread. One Outlayer Sequence Two Distantly related Groups Long Internal Indel Long Terminal Indel

80 Choosing The Right Method

81 Choosing The Right Method (POA Evaluation)

82 Choosing The Right Method (POA Evaluation)

83 Choosing The Right Method (MAFFT evaluation)

84 Choosing The Right Method (MAFFT evaluation)

85 Choosing The Right Method (MAFFT evaluation)

86 Conclusion

87 What Is BaliBase Which Method ?
Source: BaliBase, Thompson et al, NAR, 1999, PROBLEM Strategy Strategy ClustalW, T-coffee, MSA, DCA T-Coffee PrrP, T-Coffee Dialign T-Coffee Dialign T-Coffee

88 Methods /Situtations 1-Carillo and Lipman: 2-Segment Based:
-MSA, DCA. -Few Small Closely Related Sequence. -Do Well When They Can Run. 2-Segment Based: -DIALIGN, MACAW. -May Align Too Few Residues -Good For Long Indels 3-Iterative: -HMMs, HMMER, SAM. -Slow, Sometimes Inaccurate -Good Profile Generators 4-Progressive: -ClustalW, Pileup, Multalign… -Fast and Sensitive

89 Addresses MAFFT Progressive www.biophys.kyoto-u.jp/katoh
POA Progressive/Simulataneous MUSCLE Progressive/Iterative


Download ppt "Recent Progress in Multiple Sequence Alignments: A Survey"

Similar presentations


Ads by Google