Tandy Warnow Department of Computer Sciences

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
CS 394C September 16, 2013 Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
MCS312: NP-completeness and Approximation Algorithms
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Constrained Exact Optimization in Phylogenetics
The NP class. NP-completeness
The Disk-Covering Method for Phylogenetic Tree Reconstruction
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Statistical tree estimation
Distance based phylogenetics
Modelling language evolution
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
Character-Based Phylogeny Reconstruction
Algorithm Design and Phylogenomics
Multiple Alignment and Phylogenetic Trees
CIPRES: Enabling Tree of Life Projects
Professor Tandy Warnow
Mathematical and Computational Challenges in Reconstructing Evolution
ICS 353: Design and Analysis of Algorithms
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Chapter 11 Limitations of Algorithm Power
New methods for simultaneous estimation of trees and alignments
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
Phylogeny estimation under a model of linguistic character evolution
Presentation transcript:

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin

Reconstructing the “Tree” of Life Handling large datasets: millions of species The “Tree of Life” is not really a tree: reticulate evolution

Evolution informs about everything in biology Big genome sequencing projects just produce data -- so what? Evolutionary history relates all organisms and genes, and helps us understand and predict interactions between genes (genetic networks) drug design predicting functions of genes influenza vaccine development origins and spread of disease origins and migrations of humans

Possible Indo-European tree (Ringe, Warnow and Taylor 2000)

Challenges in estimating phylogenies Computational: almost all “good” approaches for estimating phylogenies involve solving NP-hard problems Statistical Data

Major methods for phylogeny reconstruction Biology: heuristics for NP-hard optimization problems Linguistics: an exact algorithm for an NP-hard optimization problem

Outline for the rest of the talk NP-hard and polynomial time problems Phylogeny reconstruction in biology: the challenge is to develop better heuristics for NP-hard problems Phylogeny reconstruction in linguistics: the NP-hard perfect phylogeny problem, and how we solve it exactly

A polynomial-time problem 2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

A polynomial-time problem 2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

A polynomial-time problem 2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

What about this? 3-colorability: Given graph G, determine if we assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

What about this? 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color. A brute-force solution seems to require O(3n) time, where n is the number of vertices.

Some decision problems can be solved in polynomial time: Can graph G be 2-colored? Does graph G have a Eulerian tour? Some decision problems seem to not be solvable in polynomial time: Can graph G be 3-colored? Does graph G have a Hamiltonian cycle?

What about this? 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color. This problem is provably NP-hard. What does this mean?

P vs. NP, continued The “big” question in theoretical computer science is: Is it possible to solve an NP-hard problem in polynomial time? If the answer is “yes”, then all NP-hard problems can be solved in polynomial time, so P=NP. This is generally not believed.

Coping with NP-hard problems Since NP-hard problems may not be solvable in polynomial time, the options are: Solve the problem exactly (but use lots of time on some inputs) Use heuristics which may not solve the problem exactly (and which might be computationally expensive, anyway)

General comments for NP-hard optimization problems Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time. You may not know when you have an optimal solution, if you use a heuristic. Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation do you need?

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Molecular Systematics V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Maximum Parsimony Given set S of sequences of the same length over the nucleotide alphabet {A,C,T,G}, find tree leaf-labelled by S with other DNA sequences (of the same length) labelling internal nodes, so as to minimize the “length” of the tree (the sum of the Hamming distances on the edges). NP-hard! 20

Solving NP-hard problems exactly is … unlikely #leaves #trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x 102900 Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia

Research: we try to develop better heuristics Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

Summary (so far) Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima. The challenge here is to find better heuristics, since exact solutions are very unlikely to ever be achievable on large datasets.

Possible Indo-European tree (Ringe, Warnow and Taylor 2000)

Phylogenies of Languages Languages evolve over time, just as biological species do (geographic and other separations induce changes that over time make different dialects incomprehensible -- and new languages appear) The result can be modelled as a rooted tree The interesting thing is that many characteristics of languages evolve without back mutation or parallel evolution -- so a “perfect phylogeny” is possible!

“Homoplasy-Free” Evolution (perfect phylogenies) YES NO

Historical Linguistic Data A character is a function that maps a set of languages, L, to a set of states. Three kinds of characters: Phonological (sound changes) Lexical (meanings based on a wordlist) Morphological (grammatical features)

Cognate Classes Two words w1 and w2 are in the same cognate class, if they evolved from the same word through sound changes. French “champ” and Italian “champo” are both descendants of Latin “campus”; thus the two words belong to the same cognate class. Spanish “mucho” and English “much” are not in the same cognate class.

The Ringe-Warnow Model of Language Evolution The nodes of the tree which contain elements of the same cognate class should form a rooted connected subgraph of the true tree The model is known as the Character Compatibility or Perfect Phylogeny.

Perfect Phylogeny A phylogeny T for a set S of taxa is a perfect phylogeny if each state of each character occupies a subtree (no character has back-mutations or parallel evolution) 30

Perfect phylogenies, cont. A=(0,0), B=(0,1), C=(1,3), D=(1,2) has a perfect phylogeny! A=(0,0), B=(0,1), C=(1,0), D=(1,1) does not have a perfect phylogeny!

A perfect phylogeny A = 0 0 B = 0 1 C = 1 3 D = 1 2 A C D B

A perfect phylogeny A = 0 0 B = 0 1 C = 1 3 D = 1 2 E = 0 3 F = 1 3 A

The Perfect Phylogeny Problem Given a set S of taxa (species, languages, etc.) determine if a perfect phylogeny T exists for S. The problem of determining whether a perfect phylogeny exists is NP-hard (McMorris et al. 1994, Steel 1991).

Triangulated Graphs A graph is triangulated if it has no simple cycles of size four or more.

Triangulating Colored Graphs: An Example A graph that can be c-triangulated

Triangulating Colored Graphs: An Example A graph that can be c-triangulated

Triangulating Colored Graphs: An Example A graph that cannot be c-triangulated

Triangulating Colored Graphs (TCG) Triangulating Colored Graphs: given a vertex-colored graph G, determine if G can be c-triangulated.

The PP and TCG Problems Buneman’s Theorem: A perfect phylogeny exists for a set S if and only if the associated character state intersection graph can be c-triangulated. The PP and TCG problems are polynomially equivalent and NP-hard. 40

A no-instance of Perfect Phylogeny B = 0 1 C = 1 0 D = 1 1 1 1 An input to perfect phylogeny (left) of four sequences described by two characters, and its partition intersection graph. Note that the partition intersection graph is 2-colored.

Solving the PP Problem Using Buneman’s Theorem “Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1

Solving the PP Problem Using Buneman’s Theorem “Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1

Some special cases are easy Binary character perfect phylogeny solvable in linear time r-state characters solvable in polynomial time for each r (combinatorial algorithm) Two character perfect phylogeny solvable in polynomial time (produces 2-colored graph) k-character perfect phylogeny solvable in polynomial time for each k (produces k-colored graphs -- connections to Robertson-Seymour graph minor theory) 44

Constructing trees in historical linguistics Maximum Compatibility: given the input matrix for the set S of languages described by the set C of characters, find a tree T leaf-labelled by S on which a maximum number of the characters in C are compatible (i.e., evolve without homoplasy). NP-hard.

The Indo-European (IE) Dataset 24 languages 22 phonological characters, 15 morphological characters, and 333 lexical characters. Total number of working characters is 390 (multiple character coding, and parallel development) A phylogenetic tree T on the IE dataset (Ringe, Taylor and Warnow) T is compatible with all but 16 characters Resolves most of the significant controversies in Indo-European evolution; shows however that Germanic is a problem (not treelike)

Phylogenetic Tree of the IE Dataset (Ringe, Warnow, and Taylor)

Explaining remaining incompatibilies We modelled the remaining incompatibilities as undetected borrowing between languages. This leads to the mathematical model of “perfect phylogenetic networks”

Modelling borrowing: Networks and Trees within Networks

“Perfect Phylogenetic Network” for IE Nakhleh et al., Language 2005

Summary NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions Many real problems have beautiful and natural combinatorial and graph-theoretic formulations

Acknowledgements NSF and the David and Lucile Packard Foundation (funding) Collaborators Bernard Moret (UNM CS), Donald Ringe (Penn Linguistics) Students: Usman Roshan and Luay Nakhleh

Phylolab, U. Texas Please visit us at http://www.cs.utexas.edu/users/phylo/