Download presentation
Presentation is loading. Please wait.
Published byRussell Hensley Modified over 9 years ago
1
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin
2
Phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona
3
Evolution informs about everything in biology Big genome sequencing projects just produce data – so what? Evolutionary history relates all organisms and genes, and helps us understand and predict –interactions between genes (genetic networks) –drug design –predicting functions of genes –influenza vaccine development –origins and spread of disease –origins and migrations of humans
4
Reconstructing the “Tree” of Life Handling large datasets: millions of species NSF funds many projects towards this goal, under the Assembling the Tree of Life (ATOL) program
5
Cyber Infrastructure for Phylogenetic Research Purpose: to create a national infrastructure of hardware, algorithms, database technology, etc., necessary to infer the Tree of Life. Group: 40 biologists, computer scientists, and mathematicians from 13 institutions. Funding: $11.6 M (large ITR grant from NSF).
6
CIPRes Members University of New Mexico Bernard Moret David Bader UCSD/SDSC Fran Berman Alex Borchers Phil Bourne John Huelsenbeck Terri Liebowitz Mark Miller University of Connecticut Paul O Lewis University of Pennsylvania Junhyong Kim Susan Davidson Sampath Kannan Val Tannen Texas A&M Tiffani Williams UT Austin Tandy Warnow David M. Hillis Warren Hunt Robert Jansen Randy Linder Lauren Meyers Daniel Miranker University of Arizona David R. Maddison University of British Columbia Wayne Maddison North Carolina State University Spencer Muse American Museum of Natural History Ward C. Wheeler NJIT Usman Roshan UC Berkeley Satish Rao Steve Evans Richard M Karp Brent Mishler Elchanan Mossel Eugene W. Myers Christos M. Papadimitriou Stuart J. Russell Rice Luay Nakhleh SUNY Buffalo William Piel Florida State University David L. Swofford Mark Holder Yale Michael Donoghue Paul Turner
7
DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
8
Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y
9
Steps in a phylogenetic analysis Gather data Align sequences Estimate phylogeny on the multiple alignment Estimate the reliable aspects of the evolutionary history (using bootstrapping, consensus trees, or other methods) Perform post-tree analyses.
10
CIPRES research in algorithms Multiple alignment and genomic alignment Heuristics for NP-hard problems (e.g. MP and ML) Statistical performance aspects of phylogeny reconstruction methods under stochastic models of evolution (including novel models) Whole genome phylogeny reconstruction Gene family evolution Reticulate evolution (e.g. horizontal gene transfer and hybridization) detection and reconstruction Data mining on sets of trees, and compact representations of these sets
11
CIPRES Algorithms Group Tandy Warnow (focus leader) David Bader, Steve Evans, John Huelsenbeck, Warren Hunt, Sampath Kannan, Dick Karp, Paul Lewis, Bernard Moret, Elchanan Mossel, Luay Nakhleh, Christos Papadimitriou, Satish Rao, Usman Roshan, Jijun Tang, Li-San Wang, Tiffani Williams
12
1.Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Phylogenetic reconstruction methods Phylogenetic trees Cost Global optimum Local optimum 2.Polynomial time distance-based methods: Neighbor Joining, FastME, Weighbor, etc. 3. Bayesian methods
13
Performance criteria Running time. Space. Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution. “Topological accuracy” with respect to the underlying true tree. Typically studied in simulation. Accuracy with respect to a particular criterion (e.g. tree length or likelihood score), on real data.
14
Markov models of site evolution Simplest (Jukes-Cantor): The model tree is a pair (T,{e,p(e)}), where T is a rooted binary tree, and p(e) is the probability of a substitution on the edge e The state at the root is random If a site changes on an edge, it changes with equal probability to each of the remaining states The evolutionary process is Markovian More complex models (such as the General Markov model) are also considered, with little change to the theory. Variation between different sites is either prohibited or minimized, in order to ensure identifiability of the model.
15
Distance-based Phylogenetic Methods
16
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized, where H(i,j) denotes the Hamming distance between sequences at nodes i and j
17
Maximum Likelihood Input: Set S of n aligned sequences of length k, and a specified parametric model Output: –A phylogenetic tree T leaf-labeled by sequences in S –With additional model parameters (e.g. edge “lengths”) such that Pr[S|(T, params)] is maximized.
18
1.Hill-climbing heuristics (which can get stuck in local optima) 2.Randomized algorithms for getting out of local optima 3.Approximation algorithms for MP (based upon Steiner Tree approximation algorithms). Approaches for “solving” MP/ML Phylogenetic trees Cost Global optimum Local optimum
19
Theoretical results Neighbor Joining is polynomial time, and statistically consistent under typical models of evolution. Maximum Parsimony is NP-hard, and even exact solutions are not statistically consistent under typical models. Maximum Likelihood is NP-hard and statistically consistent under typical models.
20
Theoretical convergence rates Atteson: Let T be a General Markov model tree defining additive matrix D. Then Neighbor Joining will reconstruct the true tree with high probability from sequences that are of length at least O(lg n e max Dij ). Proof: Show NJ accurate on input matrix d such that max{|D ij -d ij |}<f/2, for f equal to the minimum edge “length”.
21
Problems with NJ Theory: The convergence rate is exponential: the number of sites needed to obtain an accurate reconstruction of the tree with high probability grows exponentially in the evolutionary diameter. Empirical: NJ has poor performance on datasets with some large leaf-to-leaf distances.
22
Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP
23
Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. NJ 0 40080016001200 No. Taxa 0 0.2 0.4 0.6 0.8 Error Rate
24
Other standard polynomial time methods don’t improve substantially on NJ (and have the same problem with large diameter datasets). What about trying to “solve” maximum parsimony or maximum likelihood?
25
Solving NP-hard problems exactly is … unlikely Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia #leaves#trees 43 515 6105 7945 810395 9135135 102027025 202.2 x 10 20 1004.5 x 10 190 10002.7 x 10 2900
26
How good an MP analysis do we need? Our research shows that we need to get within 0.01% of optimal (or better even, on large datasets) to return reasonable estimates of the true tree’s “topology”
27
Problems with current techniques for MP Average MP scores above optimal of best methods at 24 hours across 10 datasets Best current techniques fail to reach 0.01% of optimal at the end of 24 hours, on large datasets
28
Problems with current techniques for MP Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time
29
Empirical problems with existing methods Heuristics for Maximum Parsimony (MP) and Maximum Likelihood (ML) cannot handle large datasets (take too long!) – we need new heuristics for MP/ML that can analyze large datasets Polynomial time methods have poor topological accuracy on large diameter datasets – we need better polynomial time methods
30
Using divide-and-conquer Conjecture: better (more accurate) solutions will be found if we analyze a small number of smaller subsets and then combine solutions Note: different “base” methods will need potentially different decompositions. Alert: the subtree compatibility problem is NP-complete!
31
Using divide-and-conquer Conjecture: better (more accurate) solutions will be found if we analyze a small number of smaller subsets and then combine solutions Note: different “base” methods will need potentially different decompositions. Alert: the subtree compatibility problem is NP-complete!
32
Using divide-and-conquer Conjecture: better (more accurate) solutions will be found if we analyze a small number of smaller subsets and then combine solutions Note: different “base” methods will need potentially different decompositions. Alert: the subtree compatibility problem is NP-complete!
33
DCMs: Divide-and-conquer for improving phylogeny reconstruction
34
Strict Consensus Merger (SCM)
35
“Boosting” phylogeny reconstruction methods DCMs “boost” the performance of phylogeny reconstruction methods. DCM Base method MDCM-M
36
DCMs (Disk-Covering Methods) DCMs for polynomial time methods improve topological accuracy (empirical observation), and have provable theoretical guarantees under Markov models of evolution DCMs for hard optimization problems reduce running time needed to achieve good levels of accuracy (empirically observation)
37
Absolute fast convergence vs. exponential convergence
38
DCM-Boosting [Warnow et al. 2001] DCM+SQS is a two-phase procedure which reduces the sequence length requirement of methods. DCMSQS Exponentially converging method Absolute fast converging method
39
DCM1 Decompositions DCM1 decomposition : compute the maximal cliques Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation
40
DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] DCM1-boosting makes distance- based methods more accurate Theoretical guarantees that DCM1-NJ converges to the true tree from polynomial length sequences NJ DCM1-NJ 0 40080016001200 No. Taxa 0 0.2 0.4 0.6 0.8 Error Rate
41
Major challenge: MP and ML Maximum Parsimony (MP) and Maximum Likelihood (ML) remain the methods of choice for most systematists The main challenge here is to make it possible to obtain good solutions to MP or ML in reasonable time periods on large datasets
42
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T –leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized.
43
Maximum parsimony (example) Input: Four sequences –ACT –ACA –GTT –GTA Question: which of the three trees has the best MP scores?
44
Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA
45
Maximum Parsimony ACT GTT GTA ACA GTA 1 2 2 MP score = 5 ACA ACT GTA GTT ACAACT 3 1 3 MP score = 7 ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Optimal MP tree
46
Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)
47
Problems with current techniques for MP Best methods are a combination of simulated annealing, divide-and-conquer and genetic algorithms, as implemented in the software package TNT. However, they do not reach 0.01% of optimal on large datasets in 24 hours. Performance of TNT with time
48
Observations The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets. Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. Apparent convergence can be misleading.
49
How can we improve upon existing techniques?
50
Our objective: speed up the best MP heuristics Time MP score of best trees Performance of hill-climbing heuristic Desired Performance Fake study
51
Divide-and-conquer technique for speeding up MP/ML searches
52
DCM Decompositions DCM1 decomposition : DCM2 decomposition: Clique-separator plus component Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation
53
But: it didn’t work! A simple divide-and-conquer was insufficient for the best performing MP heuristics -- TNT by itself was as good as DCM(TNT).
54
Empirical observation DCM1 not as good as DCM2 for MP DCM2 decompositions too large, too slow to compute. Neither improved the best MP heuristics.
55
How can we improve upon existing techniques?
56
Tree Bisection and Reconnection (TBR)
57
Delete an edge
58
Tree Bisection and Reconnection (TBR)
59
Reconnect the trees with a new edge that bifurcates an edge in each tree
60
A conjecture as to why current techniques are poor: Our studies suggest that trees with near optimal scores tend to be topologically close (RF distance less than 15%) from the other near optimal trees. The standard technique (TBR) for moving around tree space explores O(n 3 ) trees, which are mostly topologically distant. So TBR may be useful initially (to reach near optimality) but then more “localized” searches are more productive.
61
Using DCMs differently Observation: DCMs make small local changes to the tree New algorithmic strategy: use DCMs iteratively and/or recursively to improve heuristics on large datasets However, the initial DCMs for MP –produced large subproblems and –took too long to compute We needed a decomposition strategy that produces small subproblems quickly.
62
Using DCMs differently Observation: DCMs make small local changes to the tree New algorithmic strategy: use DCMs iteratively and/or recursively to improve heuristics on large datasets However, the initial DCMs for MP –produced large subproblems and –took too long to compute We needed a decomposition strategy that produces small subproblems quickly.
63
Using DCMs differently Observation: DCMs make small local changes to the tree New algorithmic strategy: use DCMs iteratively and/or recursively to improve heuristics on large datasets However, the initial DCMs for MP –produced large subproblems and –took too long to compute We needed a decomposition strategy that produces small subproblems quickly.
64
New DCM3 decomposition Input: Set S of sequences, and guide-tree T 1. Compute short subtree graph G(S,T), based upon T 2. Find clique separator in the graph G(S,T) and form subproblems DCM3 decompositions (1) can be obtained in O(n) time (2) yield small subproblems (3) can be used iteratively
65
Iterative-DCM3 T T’ Base method DCM3
66
New DCMs DCM3 1.Compute subproblems using DCM3 decomposition 2.Apply base method to each subproblem to yield subtrees 3.Merge subtrees using the Strict Consensus Merger technique 4.Randomly refine to make it binary Recursive-DCM3 Iterative DCM3 1.Compute a DCM3 tree 2.Perform local search and go to step 1 Recursive-Iterative DCM3
67
Comparison of DCM decompositions (Maximum subset size) DCM2 subproblems are almost as large as the full dataset size on datasets 1 through 4. On datasets 5-10 DCM2 was too slow to compute a decomposition within 24 hours.
68
Datasets 1322 lsu rRNA of all organisms 2000 Eukaryotic rRNA 2594 rbcL DNA 4583 Actinobacteria 16s rRNA 6590 ssu rRNA of all Eukaryotes 7180 three-domain rRNA 7322 Firmicutes bacteria 16s rRNA 8506 three-domain+2org rRNA 11361 ssu rRNA of all Bacteria 13921 Proteobacteria 16s rRNA Obtained from various researchers and online databases
69
Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet. DCM2 tree takes almost 10 hours to produce a tree and is too slow to run on larger datasets. Rec-I-DCM3 is the best method at all times.
70
Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the default to recursion to iteration to recursion+iteration. On very large datasets Rec-I-DCM3 gives significant improvements over unboosted TNT.
71
Rec-I-DCM3 significantly improves performance Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset Current best techniques DCM boosted version of best techniques
72
Rec-I-DCM3(TNT) vs. TNT (Comparison of scores at 24 hours) Base method is the default TNT technique, the current best method for MP. Rec-I-DCM3 significantly improves upon the unboosted TNT by returning trees which are at most 0.01% above optimal on most datasets.
73
Observations Rec-I-DCM3 improves upon the best performing heuristics for MP. The improvement increases with the difficulty of the dataset.
74
DCMs DCM for NJ and other distance methods produces absolute fast converging (afc) methods DCMs for MP heuristics DCMs for use with the GRAPPA software for whole genome phylogenetic analysis; these have been shown to let GRAPPA scale from its maximum of about 15-20 genomes to 1000 genomes. Current projects: DCM development for maximum likelihood and multiple sequence alignment.
75
Part II: Whole-Genome Phylogenetics A B C D E F X Y Z W A B C D E F
76
Genomes Evolve by Rearrangements Inverted Transposition 1 2 3 9 -8 –7 –6 –5 –4 10 1 2 3 4 5 6 7 8 9 10 Inversion (Reversal) 1 2 3 –8 –7 –6 –5 -4 9 10 Transposition 1 2 3 9 4 5 6 7 8 10
77
Genome Rearrangement Has A Huge State Space DNA sequences : 4 states per site Signed circular genomes with n genes: states, 1 site Circular genomes (1 site) –with 37 genes: states –with 120 genes: states
78
Why use gene orders? “Rare genomic changes”: huge state space and relative infrequency of events (compared to site substitutions) could make the inference of deep evolution easier, or more accurate. Our research shows this is true, but accurate analysis of gene order data is computationally very intensive!
79
Maximum Parsimony on Rearranged Genomes (MPRG) The leaves are rearranged genomes. Find the tree that minimizes the total number of rearrangement events (NP-hard) A B C D 3 6 2 3 4 A B C D E F Total length = 18
80
Optimization problems for gene order phylogeny Breakpoint phylogeny: find the phylogeny which minimizes the total number of breakpoints (NP-hard, even to find the median of three genomes) Inversion phylogeny: find the phylogeny which minimizes the sum of inversion distances on the edges (NP-hard, even to find the median of three genomes)
81
“Solving” the inversion phylogeny Phylogenetic trees MP score Global optimum Local optimum Usual issue of getting stuck in local optima, since the optimization problems are NP-hard Additional problem: finding the best trees is enormously hard, since even the “point estimation” problem is hard (worse than estimating branch lengths in ML).
82
Benchmark gene order dataset: Campanulaceae 12 genomes + 1 outgroup (Tobacco), 105 gene segments NP-hard optimization problems: breakpoint and inversion phylogenies (techniques score every tree) 1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.) 2000: Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster machine: 2 minutes (200,000-fold speedup per processor) 2003: Using latest version of GRAPPA: 2 minutes on a single processor (1-billion-fold speedup per processor)
83
GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms) http://www.cs.unm.edu/~moret/GRAPPA/ Heuristics for NP-hard optimization problems Fast polynomial time distance-based methods Contributors: U. New Mexico, U. Texas at Austin, Universitá di Bologna, Italy Freely available in source code at this site. Project leader: Bernard Moret (UNM) (moret@cs.unm.edu)
84
Limitations and ongoing research Current methods are mostly limited to single chromosomes with equal gene content (or very small amounts of deletions and duplications). We have made some progress on developing a reliable distance-based method for chromosomes with unequal gene content (tests on real and simulated data show high accuracy) Handling the multiple chromosome case is harder
85
Other research projects in molecular phylogenetics Many equally good solutions for a given dataset - how can we figure out “truth”? Not all evolution is tree-like - how can we detect and infer reticulate evolution ? How can we visualize large trees, or enable the visualization of differences/similarities between different trees?How can we visualize large trees, or enable the visualization of differences/similarities between different trees? Phyloinformatics -- what database capabilities do we need to utilize phylogenies in biological research ?
86
Questions Tree shape (including branch lengths) has an impact on phylogeny reconstruction - but what model of tree shape to use? What is the sequence length requirement for Maximum Likelihood? (Result by Szekely and Steel is worse than that for Neighbor Joining.) Why is MP not so bad?
87
General comments There is interesting computer science research to be done in computational phylogenetics, with a tremendous potential for impact. Algorithm development must be tested on both real and simulated data. The interplay between data, stochastic models of evolution, optimization problems, and algorithms, is important and instructive.
88
Reconstructing the “Tree” of Life Handling large datasets: millions of species The “Tree of Life” is not really a tree: reticulate evolution
89
Acknowledgements NSF The David and Lucile Packard Foundation The Program in Evolutionary Dynamics at Harvard The Institute for Cellular and Molecular Biology at UT- Austin Collaborators: Usman Roshan, Bernard Moret, and Tiffani Williams See http://www.phylo.org and http://www.cs.utexas.edu/~tandy for more infohttp://www.phylo.org http://www.cs.utexas.edu/~tandy
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.