Download presentation
Presentation is loading. Please wait.
1
http://creativecommons.org/licenses/by-sa/2.0/
2
CIS786, Lecture 3 Usman Roshan
3
Maximum Parsimony Character based method NP-hard (reduction to the Steiner tree problem) Widely-used in phylogenetics Slower than NJ but more accurate Faster than ML Assumes i.i.d.
4
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T –leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized.
5
Maximum parsimony (example) Input: Four sequences –ACT –ACA –GTT –GTA Question: which of the three trees has the best MP scores?
6
Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA
7
Maximum Parsimony ACT GTT GTA ACA GTA 1 2 2 MP score = 5 ACA ACT GTA GTT ACAACT 3 1 3 MP score = 7 ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Optimal MP tree
8
Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)
9
Local search strategies Phylogenetic trees Cost Global optimum Local optimum
10
Local search for MP Determine a candidate solution s While s is not a local minimum –Find a neighbor s’ of s such that MP(s’)<MP(s) –If found set s=s’ –Else return s and exit Time complexity: unknown---could take forever or end quickly depending on starting tree and local move Need to specify how to construct starting tree and local move
11
Starting tree for MP Random phylogeny---O(n) time Greedy-MP
12
Greedy-MP takes O(n^3k) time
13
Faster Greedy MP 3-way labeling If we can assign optimal labels to each internal node rooted in each possible way, we can speed up computation by order of n Optimal 3-way labeling –Sort all 3n subtrees using bucket sort in O(n) –Starting from small subtrees compute optimal labelings –For each subtree rooted at v, the optimal labelings of children nodes is already computed –Total time: O(nk)
14
Faster Greedy MP 3-way labeling If we can assign optimal labels to each internal node rooted in each possible way, we can speed up computation by order of n Optimal 3-way labeling –Sort all 3n subtrees using bucket sort in O(n) –Starting from small subtrees compute optimal labelings –For each subtree rooted at v, the optimal labelings of children nodes is already computed –Total time: O(nk) With optimal labeling it takes constant Time to compute MP score for each Edge and so total Greedy-MP time Is O(n^2k)
15
Local moves for MP: NNI For each edge we get two different topologies Neighborhood size is 2n-6
16
Local moves for MP: SPR Neighborhood size is quadratic in number of taxa Computing the minimum number of SPR moves between two rooted phylogenies is NP-hard
17
Local moves for MP: TBR Neighborhood size is cubic in number of taxa Computing the minimum number of TBR moves between two rooted phylogenies is NP-hard
18
Tree Bisection and Reconnection (TBR)
19
Delete an edge
20
Tree Bisection and Reconnection (TBR)
21
Reconnect the trees with a new edge that bifurcates an edge in each tree
22
Local optima is a problem
23
Iterated local search: escape local optima by perturbation Local optimum Local search
24
Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search
25
Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search
26
ILS for MP Ratchet Iterative-DCM3 TNT
27
Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search
28
Ratchet Perturbation input: alignment and phylogeny –Sample with replacement p% of sites and reweigh them to w –Perform local search on modified dataset starting from the input phylogeny –Reset the alignment to original after completion and output the local minimum
29
Ratchet: escaping local minima by data perturbation Local optimum Output of ratchet Ratchet search Local search
30
Ratchet: escaping local minima by data perturbation Local optimum Output of ratchet Ratchet search Local search But how well does this perform? We have to examine this experimentally on real data
31
Experimental methodology for MP on real data Collect alignments of real datasets –Usually constructed using ClustalW –Followed by manual (eye) adjustments –Must be reliable to get sensible tree! Run methods for a fixed time period Compare MP scores as a function of time –Examine how scores improve over time –Rate of convergence of different methods (not sequence length but as a function of time)
32
Experimental methodology for MP on real data We use rRNA and DNA alignments Obtained from researchers and public databases We run iterative improvement and ratchet each for 24 hours beginning from a randomized greedy MP tree Each method was run five times and average scores were plotted We use PAUP*---very widely used software package for various types of phylogenetic analysis
33
500 aligned rbcL sequences (Zilla dataset)
34
854 aligned rbcL sequences
35
2000 aligned Eukaryotes
36
7180 aligned 3domain
37
13921 aligned Proteobacteria
38
Comparison of MP heuristics What about other techniques for escaping local minima? TNT: a combination of divide-and-conquer, simulated annealing, and genetic algorithms –Sectorial search (random): construct ancestral sequence states using parsimony; randomly select a subset of nodes; compute iterative-improvement trees and if better tree found then replace –Genetic algorithm (fuse): Exchange subtrees between two trees to see if better ones are found –Default search: (1) Do sectorial search starting from five randomized greedy MP trees; (2) apply genetic algorithm to find better ones; (3) output best tree
39
Comparison of MP heuristics What about other techniques for escaping local minima? TNT: a combination of divide-and-conquer, simulated annealing, and genetic algorithms –Sectorial search (random): construct ancestral sequence states using parsimony; randomly select a subset of nodes; compute iterative-improvement trees and if better tree found then replace –Genetic algorithm (fuse): Exchange subtrees between two trees to see if better ones are found –Default search: (1) Do sectorial search starting from five randomized greedy MP trees; (2) apply genetic algorithm to find better ones; (3) output best tree How does this compare to PAUP*-ratchet?
40
Experimental methodology for MP on real data We use rRNA and DNA alignments Obtained from researchers and public databases We run PAUP*-ratchet, TNT-default, and TNT-ratchet each for 24 hours beginning from randomized greedy MP trees Each method was run five times on each dataset and average scores were plotted
41
500 aligned rbcL sequences (Zilla dataset)
42
854 aligned rbcL sequences
43
2000 aligned Eukaryotes
44
7180 aligned 3domain
45
13921 aligned Proteobacteria
46
Can we do even better? Yes! But first let’s look at Disk-Covering Methods
47
Disk Covering Methods (DCMs) DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. DCMs to date –DCM1: for improving statistical performance of distance-based methods. –DCM2: for improving heuristic search for MP and ML –DCM3: latest, fastest, and best (in accuracy and optimality) DCM
48
DCM2 technique for speeding up MP searches 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary
49
2.Find separator X in G which minimizes max where are the connected components of G – X 3.Output subproblems as. DCM2 Input: distance matrix d, threshold, sequences S Algorithm: 1a. Compute a threshold graph G using q and d 1b. Perform a minimum weight triangulation of G DCM2 decomposition
50
Threshold graph Add edges until graph is connected Perform minimum weight triangulation –NP-hard –Triangulated graph=perfect elimination ordering (PEO) –Max cliques can be determined in linear time –Use greedy triangulation heuristic: compute PEO by adding vertices which minimize largest edge added –Worst case is O(n^3) but fast in practice
51
1.Find separator X in G which minimizes max where are the connected components of G – X 2.Output subproblems as 3.This takes O(n^3) worst case time: perform depth first search on each component (O(n^2)) for each of O(n) separators Finding DCM2 separator
52
DCM2 subsets
53
DCM3 decomposition - example
54
DCM1 vs DCM2 DCM1 decomposition : NJ gets better accuracy on small diameter subproblems (which we shall return to later) DCM2 decomposition: Getting a smaller number of smaller subproblems speeds up solution
55
We saw how decomposition takes place, now on to supertree methods 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary
56
Supertree Methods
57
Optimization problems Subtree Compatibility: Given set of trees,does there exist tree,such that, (we say contains ). NP-hard (Steel 1992) Special cases are poly-time (rooted trees, DCM) MRP: also NP-hard
58
Direct supertree methods Strict consensus supertrees, MinCutSupertrees
59
Indirect supertree methods MRP, Average consensus
60
MRP---Matrix Representation using Parsimony (very popular)
61
Strict Consensus Merger---faster and used in DCMs 12 3 46 5 12 3 7 4 1 3 2 4 12 3 4 1 2 3 4 1 2 3 4 5 6 7
62
Strict Consensus Merger: compatible subtrees
63
Strict Consensus Merger: compatible but collision
64
Strict Consensus Merger: incompatible subtrees
65
Strict Consensus Merger: incompatible and collision
66
Strict Consensus Merger: difference from Gordon’s SC method
67
We saw how decomposition takes place, now on to supertree methods 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary
68
Tree Refinement Challenge: given unresolved tree, find optimal refinement that has an optimal parsimony score NP-hard
69
Tree Refinement e a bc d f g h a b cd f g h e d e a b c f g h a b cf g h de
70
We saw how decomposition takes place, now on to supertree methods 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary
71
Comparing DCM decompositions
72
Study of DCM decompositions DCM2 is faster and better than DCM1 Comparison of MP scoresComparison of running times
73
Best DCM (DCM2) vs Random Comparison of MP scoresComparison of running times DCM2 is better than RANDOM w.r.t MP scores and running times
74
DCM2 (comparing two different thresholds) Comparison of MP scoresComparison of running times
75
Threshold selection techniques Biological dataset of 503 rRNA sequences. Threshold value at which we get two subproblems has best MP score.
76
Comparing supertree methods
77
MRP vs. SCM 1.SCM is better than MRP Comparison of MP scoresComparison of running times
78
Comparing tree refinement techniques
79
Study of tree refinement techniques Comparison of MP scoresComparison of running times Constrained tree search had best MP scores but is slower than other methods
80
Next time DCM1 for improving NJ Recursive-Iterative-DCM3: state of the art in solving MP and ML
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.