Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame.

Similar presentations


Presentation on theme: "Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame."— Presentation transcript:

1 Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame

2 Cédric Notredame (07/11/2015) Our Scope What are The existing Methods? How Do They Work: -Assemby Algorithms -Weighting Schemes. When Do They Work ? Which Future?

3 Cédric Notredame (07/11/2015) Outline -Introduction -A taxonomy of the existing Packages -A few algorithms… -Performance Comparison using BaliBase

4 Cédric Notredame (07/11/2015) Introduction

5 Cédric Notredame (07/11/2015) What Is A Multiple Sequence Alignment? A MSA is a MODEL It Indicates the RELATIONSHIP between residues of different sequences. It REVEALS -Similarities -Inconsistencies LIKE ANY MODEL

6 Cédric Notredame (07/11/2015) chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Motifs/Patterns Phylogeny Profiles Struc. Prediction Multiple Alignments Are CENTRAL to MOST Bioinformatics Techniques.

7 Cédric Notredame (07/11/2015) How Can I Use A Multiple Sequence Alignment? Multiple Alignments Is the most INTEGRATIVE Method Available Today. We Need MSA to INCORPORATE existing DATA

8 Cédric Notredame (07/11/2015) Why Is It Difficult To Compute A multiple Sequence Alignment? A CROSSROAD PROBLEM BIOLOGY: What is A Good Alignment COMPUTATION What is THE Good Alignment chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: *

9 Cédric Notredame (07/11/2015) Why Is It Difficult To Compute A multiple Sequence Alignment ? BIOLOGY CIRCULAR PROBLEM.... Good Sequences Good Alignment COMPUTATION

10 Cédric Notredame (07/11/2015) A Taxonomy of Multiple Sequence Alignment Methods

11 Cédric Notredame (07/11/2015) Grouping According to the assembly Algorithm

12 Cédric Notredame (07/11/2015) SimultaneousAs opposed to Progressive ExactAs opposed to Heursistic StochasticAs opposed to Determinist IterativeAs opposed to Non Iterative [Simultaneous: they simultaneously use all the information] [Heuristics: cut corners like Blast Vs SW] [Heuristics: do not guarranty an optimal solution] [Stochastic: contain an element of randomness] [Stochastic: Example of a Monte Carlo Surface estimation ] [Iterative: Most stochastic methods are iterative] [Iterative: run the same algorithm many times]

13 Cédric Notredame (07/11/2015) Iterative Iteralign Prrp SAMHMMer SAGA GA Clustal Dialign T-Coffee Progressive Simultaneous MSA POA OMA Praline MAFFT DCA Combalign Non tree based GAs HMMs

14 Cédric Notredame (07/11/2015) Iterative Iteralign Prrp SAMHMMer GA Clustal Dialign T-Coffee Progressive Simultaneous MSA POA OMA Praline MAFFT DCA Combalign StochasticSAGA

15 Cédric Notredame (07/11/2015) NEARLY EVERY OPTIMISATION ALGORITHM HAS BEEN APPLIED TO THE MSA PROBLEM!!!

16 Cédric Notredame (07/11/2015) Grouping According to the Objective Function

17 Cédric Notredame (07/11/2015) Scoring an Alignment: Evolutionary based methods BIOLOGY How many events separate my sequences? Such an evaluation relies on a biological model. COMPUTATION Every position musd be independant

18 Cédric Notredame (07/11/2015) REAL Tree Model: ALL the sequences evolved from the same ancestor A A A C Tree: Cost=1 C A AAACCAAACC A C A PROBLEM: We do not know the true tree

19 Cédric Notredame (07/11/2015) STAR Tree Model: ALL the sequences have the same ancestor A A A C Star Tree: Cost=2 C A AAACCAAACC A PROBLEM: the tree star is phylogenetically wrong

20 Cédric Notredame (07/11/2015) Sums of Pairs Model=Every sequence is the ancestor of every sequence A A A C Sums of Pairs: Cost=6 C A AAACCAAACC PROBLEM: -over-estimation of the mutation costs -Requires a weighting scheme [s(a,b): matrix] [i: column i] [k, l: seq index]

21 Cédric Notredame (07/11/2015) Sums of Pairs: Some of itslimitations (Durbin, p140) L LLLLLLLLLL G Cost=5*N*(N-1)/2-(5)*(N-1) - (-4)*(N-1) [glycine effect] Cost=5*N*(N-1)/2-(9)*(N-1) Cost= 5*N*(N-1)/2 [5: Leucine Vs Leucine with Blosum50]

22 Cédric Notredame (07/11/2015) Sums of Pairs: Some of its limitations (Durbin, p140) L LLLLLLLLLL G Delta= 2*(9)*(N-1) 5*N*(N-1) = (9) 5*N N Delta Conclusion: The more Leucine, the less expensive it gets to add a Glycin to the column...

23 Cédric Notredame (07/11/2015) Enthropy based Functions Model: Minimize the enthropy (variety) in each Column A AAACCAAACC PROBLEM: -requires a simultaneous alignment -assumes independant sequences [number of Alanine (a) in column i] [Score of column i] [a: alphabet] [P can incorporate pseudocounts] S=0 if the column is conserved

24 Cédric Notredame (07/11/2015) Consistency based Functions Model: Maximise the consistency (agreement) with a list of constraints (alignments) PROBLEM: -requires a list of constraints A AAACCAAACC [kand l are sequences, i is a column] [the two residues are found aligned in the list of constraints]

25 Cédric Notredame (07/11/2015) Concistency Based Iteralign Dialign T-Coffee Praline Combalign Prrp Clustal POA MSA MAFFT OMA DCA SAGA Weighted Sums of Pairs Enthropy SAM HMMer GIBBS

26 Cédric Notredame (07/11/2015) A few Multiple Sequence Alignment Algorithms

27 Cédric Notredame (07/11/2015) A Few Algorithms MSA and DCA ClustalW Dialign II Prrp SAGA GIBBS Sampler MAFFT POA

28 Cédric Notredame (07/11/2015) Simultaneous: MSA and DCA

29 Cédric Notredame (07/11/2015) Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Do Well When They Can Run. -Memory and CPU hungry

30 Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP S ( ) = S ( S ( ) ) + … [Pairwise projection of sequences k and l]

31 Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds a(k,l)=score of the projection k l in the optimal MSA â(k,l)=score of the optimal alignment of k l  (a(x,y))=score of the complete multiple alignment a(k,l) â(k,l) a(k,m)â(k,m) ? Upper Lower

32 Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds LM: a lower bound for the complete MSA a(k,l)>=LM +â(k,l)-  (â(x,y)) LM<=  (â(x,y)) - (â(k,l)-a(k,l)) a(k,l)â(k,l) LM+ â(k,l)-  (â(x,y)) ?

33 Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds LM: can be measured on ANY heuristic alignment a(k,l)â(k,l) LM+ â(k,l)-  (â(x,y)) ä(k,l) LM =  (ä(x,y)) The better LM, the tighter the bounds…

34 Cédric Notredame (07/11/2015) MSA: the carillo and Lipman bounds backward Forward Best( M-i, N-j)Best( 0-i, 0-j) 0 M N 0 M N +

35 Cédric Notredame (07/11/2015) Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Do Well When They Can Run. -Memory and CPU hungry

36 Cédric Notredame (07/11/2015) Simultaneous Alignments : DCA -Few Small Closely Related Sequence, but less limited than MSA -Do Well When Can Run. -Memory and CPU hungry, but less than MSA

37 Cédric Notredame (07/11/2015) Simultaneous With a New Sequence Representaion: POA-Partial Ordered Graph

38 Cédric Notredame (07/11/2015)

39

40 POA POA makes it possible to represent complex relationships: -domain deletion -domain inversions

41 Cédric Notredame (07/11/2015) Progressive: ClustalW

42 Cédric Notredame (07/11/2015) Progressive Alignment: ClustalW Feng and Dolittle, 1988; Taylor 198ç Clustering

43 Cédric Notredame (07/11/2015) Dynamic Programming Using A Substitution Matrix Progressive Alignment: ClustalW

44 Cédric Notredame (07/11/2015) Tree based Alignment : Recursive Algorithm Align ( Node N) { if ( N->left_child is a Node) A1=Align ( N->left_child) else if ( N->left_child is a Sequence) A1=N->left_child if (N->right_child is a node) A2=Align (N->right_child) else if ( N->right_child is a Sequence) A2=N->right_child Return dp_alignment (A1, A2) } AD E FG C B

45 Cédric Notredame (07/11/2015) Progressive Alignment : ClustalW -Depends on the ORDER of the sequences (Tree). -Depends on the CHOICE of the sequences. -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

46 Cédric Notredame (07/11/2015) Weighting Within ClustalW Progressive Alignment : ClustalW Weighting

47 Cédric Notredame (07/11/2015) Position Specific GOP Progressive Alignment : ClustalW GOP

48 Cédric Notredame (07/11/2015) ClustalW is the most Popular Method -Fast -Greedy Heuristic (No Guarranty). Progressive Alignment : ClustalW -Scales Well: N, N L 3 22

49 Cédric Notredame (07/11/2015) Progressive Alignment With a Heuristic DP: MAFFT

50 Cédric Notredame (07/11/2015)

51 Progressive And Concistency Based Dialign II

52 Cédric Notredame (07/11/2015) Dialign II 1) Identify best chain of segments on each pair of sequence. Assign a Pvalue to each Segment Pair. 3) Assemble the alignment according to the segment pairs. 2) Ré-évaluate each segment pair according to its consistency with the others

53 Cédric Notredame (07/11/2015) Dialign II -May Align Too Few Residues -No Gap Penalty -Does well with ESTs

54 Cédric Notredame (07/11/2015) Progressive And Concistency Based T-COFFEE

55 Cédric Notredame (07/11/2015) Mixing Local and Global Alignments Local AlignmentGlobal Alignment Extension Multiple Sequence Alignment

56 Cédric Notredame (07/11/2015) What is a library? Extension+T-Coffee Library Based Multiple Sequence Alignment 2 Seq1 MySeq Seq2 MyotherSeq #1 2 1 1 25 3 8 70 …. 3 Seq1 anotherseq Seq2 atsecondone Seq3 athirdone #1 2 1 1 25 #1 3 3 8 70 ….

57 Cédric Notredame (07/11/2015) Iterative

58 Cédric Notredame (07/11/2015) 7.16.1 Progressive Iterative Methods -HMMs, HMMER, SAM. -Slow, Sometimes Inaccurate -Good Profile Generators

59 Cédric Notredame (07/11/2015) 7.16.2 Prrp Initial Alignment Tree and weights computation Weights converged End Realign two sub-groups Alignment converged YES NO YES NO Inner Iteration Outer Iteration Iterative Methods : Prrp

60 Cédric Notredame (07/11/2015) Iterative Sochastic: SAGA, The Genetic Algorithm

61 Cédric Notredame (07/11/2015)

62

63

64 Automatic scheduling of the operators

65 Cédric Notredame (07/11/2015)

66 Weighting Schemes

67 Cédric Notredame (07/11/2015) The Problem The sequences Contain Correlated Information Most scoring Schemes Ignore this Correlation

68 Cédric Notredame (07/11/2015) Weighting Sequence Pairs with a Tree: Carillo and Lipman Rationale I

69 Cédric Notredame (07/11/2015) A DE F G CB E=EDGE P=Evolutive Path from A to X E must contribute the same weight to every path P that goes throught it. QUESTION: Which Weight for a Pair of Sequences All the weights using E must sum to 1:  (W P,E)=1. Wp=  N k-1 ) 1 Nk: Number of Edges meeting on Node k.

70 Cédric Notredame (07/11/2015) USAGE

71 Cédric Notredame (07/11/2015) PROBLEM: Weight Depends only on the Tree topology B A C AB: 0.5 AC: 0.5 BC: 0.5. B A C AB: 0.5 AC: 0.5 BC: 0.5.

72 Cédric Notredame (07/11/2015) Weighting Sequences with a Tree Clustal W Weights

73 Cédric Notredame (07/11/2015) G A DE FCB QUESTION: Which Weight for Sequences ? W=Length *1/4 W=Length *1/2 W=Length *1 G G W=  W) Number Sequences Sharing Edge Edge Length W seq = 

74 Cédric Notredame (07/11/2015) USAGE

75 Cédric Notredame (07/11/2015) PROBLEM: Overweight of distant sequences DE F G C -C Will dominate the Alignment -C Will be very Difficult to align

76 Cédric Notredame (07/11/2015) Performance Comparison Using Collections of Reference Alignments: BaliBase and Ribosomal RNA

77 Cédric Notredame (07/11/2015) What Is BaliBase BaliBase BaliBase is a collection of reference Multiple Alignments The Structure of the Sequences are known and were used to assemble the MALN. Evaluation is carried out by Comparing the Structure Based Reference Alignment With its Sequence Based Counterpart

78 Cédric Notredame (07/11/2015) What Is BaliBase BaliBase DALI, Sap …  Method X Comparison

79 Cédric Notredame (07/11/2015) What Is BaliBase BaliBase Description PROBLEM Source: BaliBase, Thompson et al, NAR, 1999, Even Phylogenic Spread. One Outlayer Sequence Two Distantly related Groups Long Internal Indel Long Terminal Indel

80 Cédric Notredame (07/11/2015) Choosing The Right Method

81 Cédric Notredame (07/11/2015) Choosing The Right Method (POA Evaluation)

82 Cédric Notredame (07/11/2015) Choosing The Right Method (POA Evaluation)

83 Cédric Notredame (07/11/2015) Choosing The Right Method (MAFFT evaluation)

84 Cédric Notredame (07/11/2015) Choosing The Right Method (MAFFT evaluation)

85 Cédric Notredame (07/11/2015) Choosing The Right Method (MAFFT evaluation)

86 Cédric Notredame (07/11/2015) Conclusion

87 Cédric Notredame (07/11/2015) What Is BaliBase Which Method ? PROBLEM Source: BaliBase, Thompson et al, NAR, 1999, Strategy ClustalW, T-coffee, MSA, DCA PrrP, T-Coffee Dialign T-Coffee Dialign T-Coffee

88 Cédric Notredame (07/11/2015) Methods /Situtations 1-Carillo and Lipman: -MSA, DCA. -Few Small Closely Related Sequence. 2-Segment Based: -DIALIGN, MACAW. -May Align Too Few Residues -Good For Long Indels -Do Well When They Can Run. 3-Iterative: -HMMs, HMMER, SAM. -Slow, Sometimes Inaccurate -Good Profile Generators 4-Progressive: -ClustalW, Pileup, Multalign… -Fast and Sensitive

89 Cédric Notredame (07/11/2015) Addresses MAFFT Progressive www.biophys.kyoto-u.jp/katoh POA Progressive/Simulataneous www.bioinformatics.ucla.edu/poa MUSCLE Progressive/Iterative www.drive5.com/muscle/


Download ppt "Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame."

Similar presentations


Ads by Google