NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
CSE332: Data Abstractions Lecture 27: A Few Words on NP Dan Grossman Spring 2010.
BNFO 602 Phylogenetics Usman Roshan.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
The Theory of NP-Completeness
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Analysis of Algorithms CS 477/677
CIS786, Lecture 3 Usman Roshan.
Chapter 11: Limitations of Algorithmic Power
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
CIS786, Lecture 4 Usman Roshan.
CS 394C September 16, 2013 Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
CSE373: Data Structures & Algorithms Lecture 22: The P vs. NP question, NP-Completeness Lauren Milne Summer 2015.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
NP-Complete problems.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
CSE 589 Part V One of the symptoms of an approaching nervous breakdown is the belief that one’s work is terribly important. Bertrand Russell.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
CS 394C March 21, 2012 Tandy Warnow Department of Computer Sciences University of Texas at Austin.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Lecture. Today Problem set 9 out (due next Thursday) Topics: –Complexity Theory –Optimization versus Decision Problems –P and NP –Efficient Verification.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Challenges in constructing very large evolutionary trees
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
Tandy Warnow Department of Computer Sciences
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Presentation transcript:

NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin

Phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona

Phylogeny From the Tree of the Life Website, University of Arizona Orangutan GorillaChimpanzee Human

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Evolution informs about everything in biology Big genome sequencing projects just produce data -- so what? Evolutionary history relates all organisms and genes, and helps us understand and predict –interactions between genes (genetic networks) –drug design –predicting functions of genes –influenza vaccine development –origins and spread of disease –origins and migrations of humans

Molecular Systematics TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

Major methods for phylogeny reconstruction Biology: Polynomial time methods (good enough for small datasets), and local search heuristics for NP-hard optimization problems Linguistics: an exact algorithm for an NP-hard optimization problem

Outline for the rest of the talk NP-hard and polynomial time problems Phylogeny reconstruction in biology: the NP-hard maximum parsimony problem, and how we can solve it better Phylogeny reconstruction in linguistics: the NP-hard perfect phylogeny problem, and how we solve it exactly An open problem from whole genome phylogeny Thoughts about computational biology, and the role of mathematics in this field

Polynomial-time problems Shortest path: Given edge-weighted graph G = (V,E) and two vertices, v and w, find shortest path from v to w (O(n 2 ) time) 2-colorability: Given graph G = (V,E), determine if we can assign two colors to the vertices of G so that no edge connects vertices of the same color (O(n+m) time) 3-clique: Given graph G = (V,E), determine if G contains a 3-clique (O(n 3 ) time) For all these, n=|V| and m=|E|.

NP-hard problems Some problems seem “hard” to solve: Hamilton path: Given graph G, determine if G has a simple path going through every vertex 3-colorability: Given graph G, determine if G can be properly 3-colored Max-clique: Given graph G, find a largest clique in the graph

Technical definition of NP-hard NP is the class of decision problems for which “yes” instances can be “proven” in polynomial time. (Example: I can prove to you that a graph has a 3-coloring by presenting that 3-coloring to you. So 3-coloring is in NP.) Definition: A problem X is NP-hard if every problem in NP can be reduced to X in polynomial time (yes-instances mapped to yes-instances, and no-instances mapped to no-instances). So 2-coloring can be reduced to 3-coloring Definition: A problem X is in P if it is in NP and can be solved in polynomial time.

NP-hard optimization problems Graph-theoretic examples: –Travelling Salesperson (TSP): find minimum cost tour visiting every vertex –Maximum Clique: find maximum sized subset of vertices which are all pairwise adjacent –Minimum Vertex Coloring: find minimum number of colors so that every vertex can be assigned a color, and no edge connects vertices of the same color.

NP-hard decision problems Each optimization problem has corresponding decision problem. For example, the max clique optimization problem corresponds to the decision problem: Input: Graph G=(V,E), positive integer B Question: Does there exist a subset V’ of V such that |V’|=B and V’ is a clique?

NP-hard problems and polynomial time problems (P vs. NP) Some decision problems can be solved in polynomial time: –Can graph G be 2-colored? –Does graph G have a 5-clique? Some decision problems seem to not be solvable in polynomial time: –Can graph G be 3-colored? –Does graph G have a k-clique?

P vs. NP, continued The “big” question in theoretical computer science is: – Is it possible to solve an NP-hard problem in polynomial time? If the answer is “yes”, then all NP-hard problems can be solved in polynomial time, so P=NP. This is generally not believed.

Coping with NP-hard problems Since NP-hard problems may not be solvable in polynomial time, the options are: –Solve the problem exactly (but use lots of time on some inputs) –Use heuristics which may not solve the problem exactly (and which might be computationally expensive, anyway)

Example: Maximum Clique Exact solution: find largest k so that some subset of size k is a clique. Runs in O(n k ) time. Heuristic: Pick a vertex at random, and greedily assemble a set which is a clique, and stop when you can’t add any more vertices. Repeat until tired (or bored, or running out of time, or …). How do we evaluate the running time, or accuracy?

General comments for NP-hard optimization problems Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time. You may not know when you have an optimal solution, if you use a heuristic. Sometimes exact solutions may not be necessary, and approximate solutions may suffice. (But this may not be true for biology.)

Major methods for phylogeny reconstruction Biology: Polynomial time methods (good enough for small datasets), and local search heuristics for NP-hard optimization problems Linguistics: an exact algorithm for an NP-hard optimization problem

Polynomial time methods Quartet-based methods: –Construct trees on all 4-leaf subsets –Combine quartet trees into tree on full dataset Distance-based methods: –Estimate pairwise distance matrix d ij –Find tree T and edge-weights w(e) so that d T ij approximates d ij For both methods, if there are no errors (in quartet trees or pairwise distances) then the correct tree can be obtained in polynomial time. Otherwise, optimization problems are NP-hard. Polytime heuristics along these lines are popular.

Phylogeny reconstruction In biology, the most popular approaches for reconstructing phylogenetic trees are heuristics for Maximum Parsimony (NP- hard) or Maximum Likelihood (conjectured to be NP-hard) In historical linguistics, a new approach based upon exactly solving the NP-hard Perfect Phylogeny problem has been useful.

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Maximum Parsimony Given a set S of strings of the same length over a fixed alphabet, find a tree T leaf-labelled by S and with all internal nodes labelled by strings of the same length over the same alphabet which minimizes the sum of the edge lengths. Motivation: seeks to minimize the total number of point mutations needed to explain the data NP-hard

Major phylogeny reconstruction methods In biology: mostly hill-climbing heuristics that attempt to solve NP-hard optimization problems (maximum parsimony or maximum likelihood) In historical linguistics: much less is established, but an exact solution to an NP-hard problem looks very promising.

Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA

Maximum Parsimony ACT GTT GTA ACA GTA MP score = 5 ACA ACT GTA GTT ACAACT MP score = 7 ACT ACA GTT GTA ACAGTA MP score = 4 Optimal MP tree

Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)

Exact solutions: fixed-parameter approaches Fixed-parameter approaches restrict some parameter and solve the problem exactly for those cases. Examples: –Does graph G=(V,E) have a k-clique? Solvable in O(n k ) time (n=|V|). –Does graph G=(V,E) have a k-coloring? Solvable in O(k n ) time for general k, and in O(n+m) time for k=2 (n=|V|, and m=|E|).

Solving MP (maximum parsimony) and ML (maximum likelihood) Phylogenetic trees MP score Global optimum Local optimum Why are MP and ML hard? The search space is huge -- there are (2n-5)!! trees, it is easy to get stuck in local optima, and there can be many optimal trees. Why try to solve MP or ML? Our experimental studies show that polynomial time algorithms don’t do as well as MP or ML when trees are big and have high rates of evolution. Why solve MP and ML well? Because trees can change in biologically significant ways with small changes in objective criterion.

MP/ML heuristics Time MP score of best trees Performance of hill-climbing heuristic Fake study

Speeding up MP/ML heuristics Time MP score of best trees Performance of hill-climbing heuristic Desired Performance Fake study

Divide-and-Conquer Approach Step 1: Get good starting tree: 1. Decompose the dataset into smaller, overlapping subsets. Construct phylogenetic trees on the subsets using a “base” method. Merge the subtrees into a single tree on the entire dataset. Refine the resultant tree to produce a binary tree. Follow with usual heuristic (hill-climbing or other such strategy) to improve tree.

Divide-and-conquer approaches: Step 1: Get good starting tree: –Divide dataset into overlapping subsets –Construct trees on each subset –Combine subtrees into tree on full dataset –Refine into binary tree if needed Step 2: Apply favored heuristic to improve tree.

Using divide-and-conquer for MP and ML Conjecture: better (more accurate) solutions will be found in less time, if we analyze a small number of smaller subsets and then combine solutions Need: –1. techniques for decomposing datasets, –2. base methods for subproblems, and –3. techniques for combining subtrees

The DCM3 technique for speeding up MP/ML searches

DCM Decompositions DCM1 decomposition :DCM2 decomposition: Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation

DCM3 Decompositions The graph G(S,T)DCM3 decomposition Input: Set S of sequences, and estimate T of the true tree 1. Compute “short subtree” graph G(S,T), based upon T 2. Find clique separator in the graph G(S,T), and form subproblems

Strict Consensus Merger (SCM)

DCM3-boosting a base method 1.Decompose the dataset into smaller, overlapping subsets, using DCM3 2.Construct phylogenetic trees on the subsets using a base method 3.Merge the subtrees into a single tree using the Strict Consensus Merger 4.Use PAUP* constrained search to refine the resultant tree

Iterative-DCM3 vs Ratchet

Comments Developing heuristics with good performance takes mathematical insights, but may not involve proofs. Even so, it’s really important. Extracting information from the set of optimal (and near-optimal) solutions is a major open problem. Other types of data (gene orders, morphology) present novel challenges. Reticulate evolution detection and reconstruction is a major open problem.

Ringe-Warnow Phylogenetic Tree of Indo-European

Cognate Classes Two words w 1 and w 2 are in the same cognate class, if they evolved from the same word through sound changes. French “champ” and Italian “champo” are both descendants of Latin “campus”; thus the two words belong to the same cognate class. Spanish “mucho” and English “much” are not in the same cognate class.

Phylogenies of Languages Languages evolve over time, just as biological species do (geographic and other separations induce changes that over time make different dialects incomprehensible -- and new languages appear) The result can be modelled as a rooted tree The interesting thing is that many characteristics of languages evolve without back mutation or parallel evolution -- so a “perfect phylogeny” is possible!

Historical Linguistic Data A character is a function that maps a set of languages, L, to a set of states. Three kinds of characters: –Phonological (sound changes) –Lexical (meanings based on a wordlist) –Morphological (grammatical features)

Perfect Phylogeny A phylogeny T for a set S of taxa is a perfect phylogeny if each state of each character occupies a subtree (no character has back-mutations or parallel evolution)

“Homoplasy-Free” Evolution (perfect phylogenies) YES NO

The Comparative Method (Hoenigswald 1960) Used to verify relatedness between languages and to infer features of the ancestral languages of a group of related languages Step 1: establish sound correspondence in a set of related languages Step 2: establish cognate classes

The Ringe-Warnow Model of Language Evolution The nodes of the tree which contain elements of the same cognate class should form a rooted connected subgraph of the true tree The model is known as the Character Compatibility or Perfect Phylogeny.

Character Compatibility and Perfect Phylogeny Ringe and Warnow postulated that all properly encoded characters for the Indo- European languages should be compatible on the true tree, if such a tree existed A tree T on which all characters are compatible is called a perfect phylogeny

The Perfect Phylogeny Problem Given a set S of taxa (species, languages, etc.) determine if a perfect phylogeny T exists for S. The problem of determining whether a perfect phylogeny exists is NP-hard (McMorris et al. 1994, Steel 1991).

Triangulated Graphs A graph is triangulated if it has no simple cycles of size four or more.

Triangulating Colored Graphs: An Example A graph that can be c-triangulated

Triangulating Colored Graphs: An Example A graph that can be c-triangulated

Triangulating Colored Graphs: An Example A graph that cannot be c-triangulated

Triangulating Colored Graphs (TCG) Triangulating Colored Graphs: given a vertex- colored graph G, determine if G can be c-triangulated.

The PP and TCG Problems Buneman’s Theorem: A perfect phylogeny exists for a set S if and only if the associated character state intersection graph can be c-triangulated. The PP and TCG problems are polynomially equivalent and NP-hard.

Solving the PP Problem Using Buneman’s Theorem “Yes” Instance of PP: c1 c2 c3 s s s s

Solving the PP Problem Using Buneman’s Theorem “Yes” Instance of PP: c1 c2 c3 s s s s

Some special cases are easy Binary character perfect phylogeny solvable in linear time r-state characters solvable in polynomial time for each r (combinatorial algorithm) Two character perfect phylogeny solvable in polynomial time (produces 2-colored graph) k-character perfect phylogeny solvable in polynomial time for each k (produces k-colored graphs -- connections to Robertson-Seymour graph minor theory)

The Indo-European (IE) Dataset 24 languages 22 phonological characters, 15 morphological characters, and 333 lexical characters Total number of working characters is 390 (multiple character coding, and parallel development) A phylogenetic tree T on the IE dataset (Ringe, Taylor and Warnow) T is compatible with all but 22 characters: 16 (18) monomorphic and 6 polymorphic Resolves most of the significant controversies in Indo-European evolution; shows however that Germanic is a problem (not treelike)

Phylogenetic Tree of the IE Dataset

An open problem to take home… computing the “transposition” distance between two genomes (important in whole genome phylogeny reconstruction)

Genomes As Signed Permutations 1 – or –3 5 –1 etc.

Genomes Evolve by Rearrangements Inverted Transposition –7 –6 –5 – Inversion (Reversal) –8 –7 –6 – Transposition

An open problem to play with Given two permutations on 1,2,…n, compute the minimum “transposition” distance (unknown computational complexity) (The corresponding problem for inversion distances involves very beautiful graph theory and algorithms.)

Summary NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions Many real problems have beautiful and natural combinatorial and graph-theoretic formulations

Acknowledgements NSF and the David and Lucile Packard Foundation (funding) Collaborators Bernard Moret (UNM CS), Donald Ringe (Penn Linguistics) Students: Usman Roshan and Luay Nakhleh

Phylolab, U. Texas Please visit us at