CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow
Phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona
Reconstructing the “Tree” of Life Handling large datasets: millions of species
Data Biomolecular sequences: DNA, RNA, amino acid, in a multiple alignment Molecular markers (e.g., SNPs, RFLPs, etc.) Morphology Gene order and content These are “character data”: each character is a function mapping the set of taxa to distinct states (equivalence classes), with evolution modelled as a process that changes the state of a character
Standard Phylogenetic Analyses Step 1: Gather sequence data, and estimate the multiple alignment of the sequences. Step 2: Estimate the evolutionary history from the multiple alignment. (This can result in many trees.) Step 3: Apply consensus methods (and other techniques) to the set of trees to figure out what is reliable.
DNA Sequence Evolution (greatly simplified!) AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y
Issues in reconstructing evolutionary histories In almost all cases, we don’t know the true evolutionary history of a dataset; how can we tell if we have the right answer? Biostatisticians have addressed this difficulty by modelling the evolutionary process as an unknown stochastic process operating on an unknown tree. This modelling allows us to study phylogeny reconstruction as a statistical inverse problem.
The Jukes-Cantor model of site evolution Each “site” is a position in a sequence The state (i.e., nucleotide) of each site at the root is random The sites evolve independently and identically (i.i.d.) If the site changes its state on an edge, it changes with equal probability to the other states For every edge e, p(e) is defined, which is the probability of change for a random site on the edge e.
Quantifying Topological Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP
Statistical consistency and convergence rates
Statistical performance issues Statistical consistency: an estimation method is statistically consistent under a model if the probability that the method returns the true tree goes to 1 as the sequence length goes to infinity Convergence rate: the amount of data that a method needs to return the true tree with high probability, as a function of the model tree
Absolute fast convergence vs. exponential convergence
Performance criteria for phylogeny reconstruction methods Speed Space Optimality criterion accuracy “Topological accuracy” (specifically statistical consistency, convergence rate, and performance on finite data) These criteria can be established theoretically, or evaluated on real or simulated data.
Statistical research in phylogenetics For a given stochastic model of evolution, prove theorems about the statistical consistency and convergence rates of phylogeny reconstruction methods Propose a new stochastic model of evolution, and study the theoretical performance of new or existing phylogeny reconstruction methods For each of the above, the same issues can be studied in simulation
Statistical research in phylogenetics For a given stochastic model of evolution, prove theorems about the statistical consistency and convergence rates of phylogeny reconstruction methods Propose a new stochastic model of evolution, and study the theoretical performance of new or existing phylogeny reconstruction methods For each of the above, the same issues can be studied in simulation
Statistical research in phylogenetics For a given stochastic model of evolution, prove theorems about the statistical consistency and convergence rates of phylogeny reconstruction methods Propose a new stochastic model of evolution, and study the theoretical performance of new or existing phylogeny reconstruction methods For each of the above, the same issues can be studied in simulation
Statistical performance Standard distance-based methods and Maximum Likelihood (solved exactly) are statistically consistent under standard models Maximum Parsimony is not always statistically consistent, even for the (simplest) DNA model But under more complex models, statistical consistency is impossible! Simulations tell a more complex story…
Methods for phylogenetic inference Statistically consistent methods: some polynomial time distance-based methods, exact solutions to maximum likelihood, and Bayesian MCMC methods, provided that the stochastic model is simple enough Statistically inconsistent methods: Heuristics for hard optimization problems (such as maximum parsimony and maximum likelihood), and exact solutions to maximum parsimony
Empirical and experimental research shows… Statistical inconsistency does not necessarily imply that a method won’t produce a highly accurate estimate of evolution: performance on finite data is different from performance on infinite data “Real” molecular evolution isn’t as simple as the models used in either estimating or simulating evolution All the “best” methods can be extremely hard to solve on large real datasets!
Grand Challenges The main challenge here is to make it possible to obtain good solutions to MP or ML in reasonable time periods on large datasets MCMC methods are increasingly used (often as a surrogate for a decent ML analysis), but it is not clear how to evaluate MCMC methods
Combinatorial algorithms research in phylogenetics For a given optimization problem, determine its computational complexity. Since most problems in this research area are NP- hard, then seek approximation algorithms and prove error bounds. Since approximation algorithms aren’t that useful in practice, develop heuristics and study their performance on real or simulated data. Reconsider whether your optimization problem was worth solving!
Standard problem: Maximum Parsimony (Hamming distance Steiner Tree) Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T –leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized.
Maximum parsimony (example) Input: Four sequences –ACT –ACA –GTT –GTA Question: which of the three trees has the best MP scores?
Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA
Maximum Parsimony ACT GTT GTA ACA GTA MP score = 5 ACA ACT GTA GTT ACAACT MP score = 7 ACT ACA GTT GTA ACAGTA MP score = 4 Optimal MP tree
Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA MP score = 4 Finding the optimal MP tree is NP-hard (and real datasets can take a very long time) Optimal labeling can be computed in linear time O(nk)
Evaluating MP heuristics with respect to MP scores Time MP score of best trees Performance of Heuristic 1 Performance of Heuristic 2 Fake study
A sample of some algorithmic questions we will discuss Large-scale phylogeny reconstruction (thousands of sequences) Multiple sequence alignment, and especially simultaneous estimation of trees and alignments Reticulate evolution (e.g., detection and reconstruction of horizontal gene transfer) Whole genome phylogenetics (e.g., using rearrangement events) New consensus methods Supertree methods Historical linguistics (i.e., reconstructing the evolutionary history of a language family)
Summary (?) Research in this area combines mathematical modelling, probability theory, combinatorial optimization, and a lot of graph theory, in order to develop algorithms with good performance. Performance is evaluated theoretically, on real data (for optimization criteria), and on simulated data (for topological accuracy).
Other stuff… No biology, probability, or statistics background required Basic background in algorithm design and analysis is expected. However, if you don’t have that background (e.g., if you are a biology student), please come see me. Please look at my class homepage for office hours and other announcements: