Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tandy Warnow Department of Computer Sciences

Similar presentations


Presentation on theme: "Tandy Warnow Department of Computer Sciences"— Presentation transcript:

1 Combinatorial and graph-theoretic problems in evolutionary tree reconstruction
Tandy Warnow Department of Computer Sciences University of Texas at Austin

2 Reconstructing the “Tree” of Life
Handling large datasets: millions of species The “Tree of Life” is not really a tree: reticulate evolution

3 Evolution informs about everything in biology
Big genome sequencing projects just produce data -- so what? Evolutionary history relates all organisms and genes, and helps us understand and predict interactions between genes (genetic networks) drug design predicting functions of genes influenza vaccine development origins and spread of disease origins and migrations of humans

4 Possible Indo-European tree (Ringe, Warnow and Taylor 2000)

5 Challenges in estimating phylogenies
Computational: almost all “good” approaches for estimating phylogenies involve solving NP-hard problems Statistical Data

6 Major methods for phylogeny reconstruction
Biology: heuristics for NP-hard optimization problems Linguistics: an exact algorithm for an NP-hard optimization problem

7 Outline for the rest of the talk
NP-hard and polynomial time problems Phylogeny reconstruction in biology: the challenge is to develop better heuristics for NP-hard problems Phylogeny reconstruction in linguistics: the NP-hard perfect phylogeny problem, and how we solve it exactly

8 A polynomial-time problem
2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

9 A polynomial-time problem
2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

10 A polynomial-time problem
2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

11 What about this? 3-colorability: Given graph G, determine if we assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

12 What about this? 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color. A brute-force solution seems to require O(3n) time, where n is the number of vertices.

13 Some decision problems can be solved in polynomial time:
Can graph G be 2-colored? Does graph G have a Eulerian tour? Some decision problems seem to not be solvable in polynomial time: Can graph G be 3-colored? Does graph G have a Hamiltonian cycle?

14 What about this? 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color. This problem is provably NP-hard. What does this mean?

15 P vs. NP, continued The “big” question in theoretical computer science is: Is it possible to solve an NP-hard problem in polynomial time? If the answer is “yes”, then all NP-hard problems can be solved in polynomial time, so P=NP. This is generally not believed.

16 Coping with NP-hard problems
Since NP-hard problems may not be solvable in polynomial time, the options are: Solve the problem exactly (but use lots of time on some inputs) Use heuristics which may not solve the problem exactly (and which might be computationally expensive, anyway)

17 General comments for NP-hard optimization problems
Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time. You may not know when you have an optimal solution, if you use a heuristic. Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation do you need?

18 DNA Sequence Evolution
-3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

19 Molecular Systematics
V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

20 Maximum Parsimony Given set S of sequences of the same length over the nucleotide alphabet {A,C,T,G}, find tree leaf-labelled by S with other DNA sequences (of the same length) labelling internal nodes, so as to minimize the “length” of the tree (the sum of the Hamming distances on the edges). NP-hard! 20

21 Solving NP-hard problems exactly is … unlikely
#leaves #trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in seconds, we would find the best tree in 2890 millennia

22 Research: we try to develop better heuristics
Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

23 Summary (so far) Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima. The challenge here is to find better heuristics, since exact solutions are very unlikely to ever be achievable on large datasets.

24 Possible Indo-European tree (Ringe, Warnow and Taylor 2000)

25 Phylogenies of Languages
Languages evolve over time, just as biological species do (geographic and other separations induce changes that over time make different dialects incomprehensible -- and new languages appear) The result can be modelled as a rooted tree The interesting thing is that many characteristics of languages evolve without back mutation or parallel evolution -- so a “perfect phylogeny” is possible!

26 “Homoplasy-Free” Evolution (perfect phylogenies)
YES NO

27 Historical Linguistic Data
A character is a function that maps a set of languages, L, to a set of states. Three kinds of characters: Phonological (sound changes) Lexical (meanings based on a wordlist) Morphological (grammatical features)

28 Cognate Classes Two words w1 and w2 are in the same cognate class, if they evolved from the same word through sound changes. French “champ” and Italian “champo” are both descendants of Latin “campus”; thus the two words belong to the same cognate class. Spanish “mucho” and English “much” are not in the same cognate class.

29 The Ringe-Warnow Model of Language Evolution
The nodes of the tree which contain elements of the same cognate class should form a rooted connected subgraph of the true tree The model is known as the Character Compatibility or Perfect Phylogeny.

30 Perfect Phylogeny A phylogeny T for a set S of taxa is a perfect phylogeny if each state of each character occupies a subtree (no character has back-mutations or parallel evolution) 30

31 Perfect phylogenies, cont.
A=(0,0), B=(0,1), C=(1,3), D=(1,2) has a perfect phylogeny! A=(0,0), B=(0,1), C=(1,0), D=(1,1) does not have a perfect phylogeny!

32 A perfect phylogeny A = 0 0 B = 0 1 C = 1 3 D = 1 2 A C D B

33 A perfect phylogeny A = 0 0 B = 0 1 C = 1 3 D = 1 2 E = 0 3 F = 1 3 A

34 The Perfect Phylogeny Problem
Given a set S of taxa (species, languages, etc.) determine if a perfect phylogeny T exists for S. The problem of determining whether a perfect phylogeny exists is NP-hard (McMorris et al. 1994, Steel 1991).

35 Triangulated Graphs A graph is triangulated if it has no simple cycles of size four or more.

36 Triangulating Colored Graphs: An Example
A graph that can be c-triangulated

37 Triangulating Colored Graphs: An Example
A graph that can be c-triangulated

38 Triangulating Colored Graphs: An Example
A graph that cannot be c-triangulated

39 Triangulating Colored Graphs (TCG)
Triangulating Colored Graphs: given a vertex-colored graph G, determine if G can be c-triangulated.

40 The PP and TCG Problems Buneman’s Theorem: A perfect phylogeny exists for a set S if and only if the associated character state intersection graph can be c-triangulated. The PP and TCG problems are polynomially equivalent and NP-hard. 40

41 A no-instance of Perfect Phylogeny
B = 0 1 C = 1 0 D = 1 1 1 1 An input to perfect phylogeny (left) of four sequences described by two characters, and its partition intersection graph. Note that the partition intersection graph is 2-colored.

42 Solving the PP Problem Using Buneman’s Theorem
“Yes” Instance of PP: c1 c2 c3 s s s s

43 Solving the PP Problem Using Buneman’s Theorem
“Yes” Instance of PP: c1 c2 c3 s s s s

44 Some special cases are easy
Binary character perfect phylogeny solvable in linear time r-state characters solvable in polynomial time for each r (combinatorial algorithm) Two character perfect phylogeny solvable in polynomial time (produces 2-colored graph) k-character perfect phylogeny solvable in polynomial time for each k (produces k-colored graphs -- connections to Robertson-Seymour graph minor theory) 44

45 Constructing trees in historical linguistics
Maximum Compatibility: given the input matrix for the set S of languages described by the set C of characters, find a tree T leaf-labelled by S on which a maximum number of the characters in C are compatible (i.e., evolve without homoplasy). NP-hard.

46 The Indo-European (IE) Dataset
24 languages 22 phonological characters, 15 morphological characters, and 333 lexical characters. Total number of working characters is 390 (multiple character coding, and parallel development) A phylogenetic tree T on the IE dataset (Ringe, Taylor and Warnow) T is compatible with all but 16 characters Resolves most of the significant controversies in Indo-European evolution; shows however that Germanic is a problem (not treelike)

47 Phylogenetic Tree of the IE Dataset (Ringe, Warnow, and Taylor)

48 Explaining remaining incompatibilies
We modelled the remaining incompatibilities as undetected borrowing between languages. This leads to the mathematical model of “perfect phylogenetic networks”

49 Modelling borrowing: Networks and Trees within Networks

50 “Perfect Phylogenetic Network” for IE Nakhleh et al., Language 2005

51 Summary NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions Many real problems have beautiful and natural combinatorial and graph-theoretic formulations

52 Acknowledgements NSF and the David and Lucile Packard Foundation (funding) Collaborators Bernard Moret (UNM CS), Donald Ringe (Penn Linguistics) Students: Usman Roshan and Luay Nakhleh

53 Phylolab, U. Texas Please visit us at


Download ppt "Tandy Warnow Department of Computer Sciences"

Similar presentations


Ads by Google