Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas at Austin

The real title: Phylogeny Estimation: Why it is “Hard” but not how to design methods with good performance - talk to me separately about this, no time in this lecture!

This talk Intro to phylogenetic estimation (using some terms to be defined later: polynomial time and NP-hard) Computational problems and what it means to solve them exactly Computational problems, and what it means to “solve them” heuristically

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Phylogeny (evolutionary tree)

Evolutionary History From AToL website Helps us –predict gene function –develop drugs and vaccines –understand disease spread –understand human origins Tree of Life Phylogenetics: estimating evolutionary histories

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

How can we infer evolution?

1.Heuristics for hard optimization problems (Maximum Parsimony and Maximum Likelihood) Two types of phylogenetic reconstruction methods Phylogenetic trees Cost Global optimum Local optimum 2.Polynomial time distance-based methods: UPGMA, Neighbor Joining, FastME, Weighbor, etc.

Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that the total number of changes is minimized

Maximum parsimony (example) Input: Four sequences –ACT –ACA –GTT –GTA Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA

Maximum Parsimony ACT GTT GTA ACA GTA 1 2 2 MP score = 5 ACA ACT GTA GTT ACAACT 3 1 3 MP score = 7 ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Optimal MP tree

Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 But how do we find the best tree? Optimal labeling can be computed in linear time O(nk)

Exhaustive Search For every tree in on the set of sequences, DO: Score each tree (compute optimal sequences for each internal node, and record the score) Keep track of the tree with the best score How expensive is this?

Don’t try “exhaustive search” Number of (unrooted) binary trees on n leaves is (2n-5)!! = (2n-5)x(2n-7)x(2n-9)x…x3 If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia #leaves#trees 43 515 6105 7945 810395 9135135 102027025 202.2 x 10 20 1004.5 x 10 190 10002.7 x 10 2900

Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)

NP-hard(ness) What does this mean? What are the consequences for a problem being NP-hard? What kind of methods are used to “solve” NP-hard problems? How should you interpret the output of a software program, when the problem is NP-hard?

“Real” problem: your brother’s birthday party Your brother is turning 10 and you need to arrange his birthday party He wants all his friends to come But some of them hate each other Your objective: have as few parties as you can, but invite everyone to at least one party (while not having people who hate each other at the same party)

Your brother’s party Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.

Graph representation of your brother’s friends Graph has vertices and edges Vertices = your brother’s friends Edges between vertices indicate they hate each other

Your brother’s party Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.

Coloring vertices to assign friends to parties Given graph G with vertices and edges Assign colors to the vertices so that no edge connects vertices of the same color, using a minimum number of colors Vertices = your brother’s friends Edges between vertices indicate they hate each other Colors = parties

Assigning friends to parties: graph coloring! Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben We can’t do this with two parties. Why? What about three?

Your brother’s parties Solution: three parties! Sally, Tommy, and Jimmy Henry and Alice Ben

What is the minimum number of colors that a graph needs? Remember: no edge between vertices of the same color!

A graph that needs 3 colors

2-colored graph

A computational problem 2-colorability: Given graph G, determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

Can we 2-color this graph?

Can we 2-color these graphs?

Solving 2-colorability 2-colorability: Given graph G, determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

What about this? 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

A 3-colored graph

Can you 3-color these graphs?

How about this graph?

Testing 3-colorability 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color. The “greedy algorithm” will work correctly in some, but not all cases.

Exhaustive search for 3-colorability Look at all possible vertex colorings See if any is “legal” (no edge between vertices of the same color) Problem: there are 3 n vertex colorings of a graph on n vertices Question to students: how many vertex colorings are there for a graph with 10 vertices? 20 vertices? 100 vertices?

What about this? 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

What about this? 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color. This problem is NP-hard. What does this mean?

Some decision problems can be solved in polynomial time: –Can graph G be 2-colored? –Does graph G have a 3-clique (three vertices that are all adjacent)? Some decision problems seem to not be solvable in polynomial time: –Can graph G be 3-colored? –What is the size of the largest clique in the graph G?

P vs. NP, continued The “big” question in theoretical computer science is: – Is it possible to solve an NP-hard problem in polynomial time? If the answer is “yes”, then all NP-hard problems can be solved in polynomial time, so P=NP. This is generally not believed.

Minimum coloring Since 3-colorability is NP-hard, finding the minimum number of colors for a graph is NP-hard. That means the problem will be very had on some graphs -- even if others can be easy. So if your brother has a lot of friends, arranging the minimum number of parties could take you a very very very very very long time. So forget solving this problem exactly!

Solving NP-hard optimization problems (like min coloring) Options: –Solve the problem exactly (but use lots of time on some inputs) –Use heuristics which may not solve the problem exactly (and which might be computationally expensive, anyway)

Phylogeny estimation is NP-hard, so Most methods that are used for maximum parsimony (or maximum likelihood) are heuristics that are not guaranteed to solve the problems exactly. Even the best methods can take a very long time (months or more) on some inputs, without being guaranteed to solve their problems well. You do not know how poor the solution is.

Start with some tree and score it Repeat Change the tree slightly, and see if the new tree has a better score. until no neighbor of your best tree has a better score (i.e., stop at a local optimum) Return the best tree you found Hill-climbing for phylogeny estimation

Exploring “tree space”

Two problems: 1. Getting “stuck” in local optima 2. Taking too long to get to good solutions “Solving” NP-hard phylogenetic estimation problems Phylogenetic trees Cost Global optimum Local optimum

Problems with current techniques for MP Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

Observations The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets. Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. Apparent convergence can be misleading.

If a problem is NP-hard Some inputs you can solve correctly and quickly, using simple algorithms. Some inputs you can solve correctly but it will take a long time. Some algorithms will give incorrect answers on some inputs. You may not know if your answer is correct or not.

Lessons Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima. Therefore we still need better heuristics. Biologists should be cautious in believing that the trees found are actually “optimal”.

Reconstructing the “Tree” of Life Handling large datasets: millions of species The “Tree of Life” is not really a tree: reticulate evolution

Phylolab, U. Texas Please visit us at http://www.cs.utexas.edu/users/phylo/

Acknowledgements Funding: NSF and the David and Lucile Packard Foundation Collaborators and students: Bernard Moret, Luay Nakhleh, Usman Roshan, and Tiffani Williams

General comments for NP-hard optimization problems Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time. You may not know when you have an optimal solution, if you use a heuristic. Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation do you need?

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.

Similar presentations

Presentation on theme: "Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.

Similar presentations

Presentation on theme: "Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas."— Presentation transcript:

Similar presentations

About project

Feedback