Presentation is loading. Please wait.

Presentation is loading. Please wait.

Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.

Similar presentations


Presentation on theme: "Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics."— Presentation transcript:

1 Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics

2 Phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona

3 DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

4 Molecular Systematics TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

5 Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

6 Methods and Conjectures Popular methods: Neighbor-Joining (polynomial time, distance-based), heuristics for Maximum Parsimony and Maximum Likelihood Big debates about which is better, and when

7 Methods and Conjectures Popular methods: Neighbor-Joining (polynomial time, distance-based), heuristics for Maximum Parsimony and Maximum Likelihood Big debates about which is better, and when Our research shows: big differences between NJ and MP, on large enough trees

8 Methods and Conjectures Popular methods: Neighbor-Joining (polynomial time, distance-based), heuristics for Maximum Parsimony and Maximum Likelihood Big debates about which is better, and when Our research shows: big differences between NJ and MP, on large enough trees Our research also shows that current techniques (in the best software packages) can be sped up, to solve MP and ML faster.

9 Computational challenges for Assembling the Tree of Life 8 million species for the Tree of Life -- cannot currently analyze more than a few hundred (and even this takes years) We need new methods for inferring large phylogenies - hard optimization problems! We need new software for visualizing large trees We need new database technology Not all phylogenies are trees, so we need methods for inferring phylogenetic networks

10 Our research projects DCM-boosting phylogenetic reconstruction methods (improving the accuracy of NJ and speeding-up MP and ML) Phylogenetic reconstruction from gene orders Reticulate evolution detection and phylogenetic network reconstruction Visualization of large trees

11 DCM-boosting NJ Outline: Convergence rates (how long do the sequences need to be for methods to reconstruct the true tree with high probability?) DCM-boosting Neighbor-Joining Experimental study comparing DCM-NJ to NJ on large trees

12 The Jukes-Cantor model of DNA sequence evolution A random DNA sequence evolves down the tree from the root The positions within the sequence evolve independently and identically If the nucleotide at a particular position changes on an edge, it changes with equal probability to the other nucleotides

13 The General Markov model of DNA sequence evolution A random DNA sequence evolves down the tree from the root The positions within the sequence evolve independently and identically (or under a distribution of rates across sites) Each edge has a 4x4 stochastic substitution matrix governing the evolution of a random site on the edge

14 Statistical Performance Issues Statistical consistency: does the reconstruction method return the true tree with high probability from long enough sequences? “Convergence Rate”: at what sequence length will the reconstruction method return the true tree with high probability? Robustness: if we violate the model conditions, what can we say about the performance of the method?

15 Absolute fast convergence vs. exponential convergence

16 Theoretical Comparison of Methods Theorem 1 [Warnow et al. 2001] DCM NJ is absolute fast converging for the GM model. Theorem 3 [Atteson 1999] NJ is exponentially converging for the GM model (but is not known to be afc).

17 DCM1: a divide-and-conquer strategy to improve NJ’s accuracy Phase I: Basic step: Divide the dataset into many small diameter subproblems. Construct NJ trees on each subproblem, and merge subtrees, using the “Strict Consensus Merger”. Refine the resultant tree using PAUP*’s constrained search. Do the basic step for each way of setting the diameter. Phase II: Pick the “best tree” out of the set of O(n 2 ) trees.

18 Strict Consensus Merger 12 3 46 5 12 3 7 4 1 3 2 4 12 3 4 1 2 3 4 1 2 3 4 5 6 7

19 DCM-Boosting [Warnow et al. 2001] DCM+SQS is a two-phase procedure which reduces the sequence length requirement of methods. DCMSQS Exponentially converging method Absolute fast converging method DCM NJ +SQS is the result of DCM-boosting NJ. We can replace SQS by MP or ML, and get better empirical performance (though not provably afc)

20 DCM-boosting Neighbor Joining DCM-boosting makes distance-based methods more accurate (we have established this for other distance-based methods, too) NJ DCM-NJ 0 40080016001200 No. Taxa 0 0.2 0.4 0.6 0.8 Error Rate

21 Summary of DCM-NJ These are the first polynomial time methods that improve upon NJ (with respect to topological accuracy) and are never worse than NJ. The advantage obtained with DCM NJ +MP and DCM NJ +ML increases with number of taxa, deviation from a molecular clock, and rate of evolution. In practice these new methods are slower than NJ (minutes vs. seconds), but still much faster than MP and ML (which can take days).

22 Time is a bottleneck for MP and ML Phylogenetic trees MP score Global optimum Local optimum Systematists tend to prefer trees with the optimal maximum parsimony score or optimal maximum likelihood score; however, both problems are hard to solve (Our experimental studies show that NJ doesn’t do as well as MP when trees are big and have high rates of evolution, so NJ and other fast methods aren’t sufficiently reliable.)

23 MP/ML heuristics Time MP score of best trees Performance of hill-climbing heuristic Fake study

24 DCM-boosting Speeding up MP/ML heuristics Time MP score of best trees Performance of hill-climbing heuristic Desired Performance Fake study

25 DCM-boosting MP and ML Idea: it is better to run a computationally expensive method on two subproblems of somewhat smaller size The DCM is different: we decompose the dataset into just two subproblems, but they are bigger, and only for one threshold, but we use the same merger technique, and same refinement stage Challenge: how to pick the best decomposition? This depends upon the base method

26 Addressing the accuracy/time issues: Disk-Covering Methods DCM1 decomposition: lots of small diameter subproblems. (Used for NJ.) DCM2 decomposition: Very few subproblems, each somewhat smaller. (Used for MP or ML.)

27 Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA

28 Maximum Parsimony ACT GTT GTA ACA GTA 1 2 2 MP score = 5 ACA ACT GTA GTT ACAACT 3 1 3 MP score = 7 ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Optimal MP tree

29 Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)

30 The DCM technique for speeding up MP/ML searches

31 DCM2-MP/ML Step 1: pick a threshold at which the threshold graph is connected, and divide the dataset into two overlapping subsets. Step 2: Compute trees on each subset using a heuristic for MP or ML Step 3: Merge subtrees using the Strict Consensus Merger Step 4: Refine the resultant tree using PAUP* constrained search

32 Phase I of DCM NJ For each value, q, in the distance matrix, compute a tree t q as follows: Divide the dataset into subsets of “diameter” q Construct trees on each subset using NJ Merge the trees using the Strict Consensus Merger technique Refine the (probably unresolved) tree into a bifurcating tree

33 Study of hill-climbing heuristics Biological dataset of 500 rbcL sequences (benchmark dataset). Previous best known trees have MP score 16531.

34 Current best DCM2 technique Pick threshold to get two subproblems Use expensive but accurate base method Use SCM to merge subtrees Use PAUP*’s constrained search with moderately expensive hill-climbing heuristic

35 DCM2 vs hill-climbing Biological dataset of 388 rRNA sequences. Maximum subproblem size = 70%

36 DCM2 vs hill-climbing Biological dataset of 503 rRNA sequences. Maximum subproblem size = 64%

37 DCM2 vs hill-climbing Biological dataset of 816 rRNA sequences. Maximum subproblem size = 55%

38 What we see Some datasets decompose well, and DCM gives real advantage The bigger the dataset, and the more careful the heuristic search, the less good the decomposition has to be for DCM to give an advantage Outlier identification may help

39 Other projects (briefly) Gene order phylogeny: GRAPPA (our free software) is the fastest and most accurate software for reconstructing phylogenies from gene order and content data. Joint project with Bob Jansen (UT) and Bernard Moret (UNM), and others. Reticulate evolution inference. Our research shows no existing method for reconstructing networks work, and that methods (such as ILD) for detecting reticulation fail. Joint project with Randy Linder (UT) and Bernard Moret.

40 Acknowledgements Funding: The David and Lucile Packard Foundation, and The National Science Foundation. Collaborators: Bernard Moret (UNM), Daniel Huson (Tubingen), Lisa Vawter (Aventis), Katherine St. John (CUNY), Randy Linder (UT), Bob Jansen (UT) Students: Luay Nakhleh, Usman Roshan, Jerry Sun, and Li-San Wang

41 Phylolab, U. Texas Please visit us at http://www.cs.utexas.edu/users/phylo/


Download ppt "Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics."

Similar presentations


Ads by Google