Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Slides:



Advertisements
Similar presentations
Chapter 4 Partition I. Covering and Dominating.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
1 Decomposing Hypergraphs with Hypertrees Raphael Yuster University of Haifa - Oranim.
An introduction to maximum parsimony and compatibility
Edge-connectivity and super edge-connectivity of P 2 -path graphs Camino Balbuena, Daniela Ferrero Discrete Mathematics 269 (2003) 13 – 20.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Greedy Algorithms Greed is good. (Some of the time)
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
1 Steiner Tree on graphs of small treewidth Algorithms and Networks 2014/2015 Hans L. Bodlaender Johan M. M. van Rooij.
Complexity ©D Moshkovitz 1 Approximation Algorithms Is Close Enough Good Enough?
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Combinatorial Algorithms
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
1 Spanning Trees Lecture 20 CS2110 – Spring
Hardness Results for Problems P: Class of “easy to solve” problems Absolute hardness results Relative hardness results –Reduction technique.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
Randomized Algorithms and Randomized Rounding Lecture 21: April 13 G n 2 leaves
T HE P ROBLEM OF R ECONSTRUCTING K - ARTICULATED P HYLOGENETIC N ETWORK Supervisor : Dr. Yiu Siu Ming Second Examiner : Professor Francis Y.L. Chin Student.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Increasing graph connectivity from 1 to 2 Guy Kortsarz Joint work with Even and Nutov.
Chapter 11: Limitations of Algorithmic Power
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Complexity ©D.Moshkovitz 1 Paths On the Reasonability of Finding Paths in Graphs.
Steiner trees Algorithms and Networks. Steiner Trees2 Today Steiner trees: what and why? NP-completeness Approximation algorithms Preprocessing.
Perfect Phylogeny MLE for Phylogeny Lecture 14
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
Programming & Data Structures
V. V. Vazirani. Approximation Algorithms Chapters 3 & 22
Fixed Parameter Complexity Algorithms and Networks.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
Modular Decomposition and Interval Graphs recognition Speaker: Asaf Shapira.
Worst-case optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks Jaroslaw Byrka 1,2, Steven Kelk 2, Katharina.
Approximating the Minimum Degree Spanning Tree to within One from the Optimal Degree R 陳建霖 R 宋彥朋 B 楊鈞羽 R 郭慶徵 R
UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
1 Steiner Tree Algorithms and Networks 2014/2015 Hans L. Bodlaender Johan M. M. van Rooij.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Techniques for Proving NP-Completeness Show that a special case of the problem you are interested in is NP- complete. For example: The problem of finding.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Relation. Combining Relations Because relations from A to B are subsets of A x B, two relations from A to B can be combined in any way two sets can be.
Computing Branchwidth via Efficient Triangulations and Blocks Authors: F.V. Fomin, F. Mazoit, I. Todinca Presented by: Elif Kolotoglu, ISE, Texas A&M University.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
NPC.
Constructing a level-2 phylogenetic network from a dense set of input triplets Leo van Iersel 1, Judith Keijsper 1, Steven Kelk 2, Leen Stougie 12 (1)
Great Theoretical Ideas in Computer Science for Some.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
12. Lecture WS 2012/13Bioinformatics III1 V12 Menger’s theorem Borrowing terminology from operations research consider certain primal-dual pairs of optimization.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
1 P and NP. 2 Introduction The Traveling Salesperson problem and thousands of other problems are equally hard in the sense that if we had an efficient.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Hongyu Liang Institute for Theoretical Computer Science Tsinghua University, Beijing, China The Algorithmic Complexity.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
The NP class. NP-completeness
Chapter 5. Optimal Matchings
Graph Algorithms Using Depth First Search
Planarity Testing.
Hierarchical clustering approaches for high-throughput data
Chapter 11 Limitations of Algorithm Power
CS 581 Tandy Warnow.
Minimum Spanning Trees
Presentation transcript:

Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time) Leo van Iersel 1, Judith Keijsper 1, Steven Kelk 2, Leen Stougie 12 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam Web:

Part 1: Context

Phylogenetic tree reconstruction Orangutan Gorilla ChimpanzeeHuman (This tree borrowed from a presentation by Tandy Warnow) Phylogenetic tree reconstruction is essentially the science of efficiently inferring and constructing plausible evolutionary trees when we only have limited input data about the ‘species’ concerned… At the intersection of biology, bioinformatics, computer science and mathematics.

Dominant methods in phylogenetic reconstruction  Character-based methods  Maximum Parsimony (= Minimum Steiner Tree)  Maximum Likelihood  Bayesian methods (Markov Chain Monte Carlo)  Distance-based methods  Neighbour Joining  UPGMA  Quartet/triplet-based methods

Triplet-based methods (1) Quartet-based methods used for constructing unrooted evolutionary trees: no root (= most distant ancestor) and edges have no direction (e.g. edge between species X and Y does not say whether X evolved into Y, or vice-versa.) Triplet-based methods are used for constructing rooted evolutionary trees: there is a root and edges are directed. The central idea: build a single, ‘big’ evolutionary tree for a set L of species by combining smaller evolutionary trees on subsets of L such that the big tree respects the structure of the smaller trees. In triplet-based methods, the small input trees are always defined on size-3 subsets of the species set L (and are called rooted triplets.)

Triplet-based methods (2) For example. Suppose I want to reconstruct a plausible evolution for the species set {W,X,Y,Z}. I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) zwxxyzyxwwzy algorithm wzxy solution

Triplet-based methods (2) For example. Suppose I want to reconstruct a plausible evolution for the species set {W,X,Y,Z}. I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) zwx xyzyxwwzy algorithm wzxy solution

Triplet-based methods (2) For example. Suppose I want to reconstruct a plausible evolution for the species set {W,X,Y,Z}. I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) zwx xyz yxwwzy algorithm wzxy solution

From trees to networks… The algorithm of Aho et al. (1981) can be used to construct trees from rooted triplets. But…what if the algorithm fails? Why might the algorithm fail? Possible reason 1: The underlying evolution is tree-like, but the input triplets contain errors. Possible reason 2: The triplets are correct, but the underlying evolution is not tree-like. Biological phenomena such as hybridization, horizontal gene transfer, recombination and gene duplication can lead to evolutionary scenarios that are not tree-like! Response: try and construct not phylogenetic trees, but phylogenetic networks

From trees to networks (2) xyzxzy For example, suppose the input is {xy|z, xz|y}. z x y (Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)

From trees to networks (2) xyz xzy For example, suppose the input is {xy|z, xz|y}. z x y (Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)

From trees to networks (2) xyz xzy For example, suppose the input is {xy|z, xz|y}. z x y (Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)

Level-k phylogenetic networks z x y root (only one!) leaf- vertex split-vertex recombination-vertex A level-k phylogenetic network is a rooted, directed acyclic graph where every biconnected component (in the underlying undirected graph) contains at most k recombination vertices. This network here is a very simple example of a level-1 network. In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the alternative name “galled tree”.

Level-k phylogenetic networks z x y root (only one!) leaf- vertex split-vertex recombination-vertex A level-k phylogenetic network is a rooted, directed acyclic graph where every biconnected component (in the underlying undirected graph) contains at most k recombination vertices. This network here is a very simple example of a level-1 network. In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the alternative name “galled tree”.

A set of input triplets is dense iff, for every subset of 3 species, there is at least one triplet corresponding to those 3 species. A dense set of input triplets for n species contains thus O(n 3 ) triplets. Jansson & Sung (2006) showed the following: What Jansson & Sung (& Nguyen) did… Given a dense set of triplets T for a set L of species, it is possible to determine in polynomial-time whether a level-1 phylogenetic network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.) They later showed, together with Nguyen, how to do this in time linear in |T|. They also showed that, in the non-dense case, the problem is NP-hard. But what about level-2 networks, and higher?

Here is an example of a level-2 network. Main result: Given a dense set of triplets T for a set L of species, it is possible to determine in time O(|T| 3 ) whether a level-2 phylogenetic network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.)

Part 2: The algorithm

Algorithm, high-level idea The algorithm is conceptually (fairly) simple, but the proof of correctness and the technical details are rather complex. The high-level idea is as follows: 1. PARTITION the set of leaves (i.e. species) into a ‘correct’ partition P; 2. INDUCE a new set of triplets T’ where every block of the partition P becomes a single leaf (a kind of ‘meta-leaf’ if you like) 3. SOLVE a simpler version of the problem for T’ to get a network N’ 4. RECURSE inside each leaf of N’ Step 3 is the critical part of the algorithm. It brings together two issues: (a) why is it sufficient to only solve a simpler version of the problem? (b) how do we solve this simpler version of the problem?

Suppose I have a partition P = {P 1, P 2, …, P t } of the leaf set L. Suppose I have a dense set of triplets T on the leaf set L. Let T’ be a new triplet set on leaf set {q 1, q 2,…, q t } defined as follows: q i q j |q k is in T’ if and only if i≠j≠k and there exists a triplet xy|z in T such that x is in P i, y is in P j and z is in P k Then we say that T’ is the triplet set induced by the partition P of L. Critically: if T is dense, then T’ is also dense. In some sense this can be perceived as a ‘coarsening’ of the input set. Definition: inducing new triplet sets from partitions of the leaf set

Definition: simple level-2 networks A simple level-2 network is any network obtained by “hanging leaves” off one of the above structures. Simple level-2 networks capture in some sense the essence of the complexity of general level-2 networks.

Here the leaves {a,b,c,d,e,f,g,h} have been ‘hung’ from structure 8a, to yield a simple level-2 network. An example of a simple level-2 network

Definition: SN-set Jansson & Sung introduced the idea of the SN-set. SN-sets are special subsets of the leaves L, and are defined with respect to triplet sets. All sets containing just a single leaf, are SN-sets. More generally, an SN-set is any subset of leaves obtained by taking the closure of the following operation on some subset S of the leaves L: some subset S of the leaves x y z

Definition: SN-set Jansson & Sung introduced the idea of the SN-set. SN-sets are special subsets of the leaves L, and are defined with respect to triplet sets. All sets containing just a single leaf, are SN-sets. More generally, an SN-set is any subset of leaves obtained by taking the closure of the following operation on some subset S of the leaves L: x y z In other words, if there is some pair of leaves x,z in the set S such that xy|z is a triplet and y is not in the set S, add y to S, and repeat until no more leaves can be added. An SN-set is any set that can be constructed this way.

The SN-set that is equal to the total leaf set L, is called the trivial SN-set. An SN-set that is non-trivial, and is not a strict subset of any other non- trivial SN-set, is called a maximal SN-set. Jansson and Sung proved that the set of maximal SN-sets partition the leaf set L. So no two maximal SN-sets overlap, and they completely cover the set of input leaves. It is polynomial-time solveable to find all the SN-sets, and all the maximal SN-sets. Jansson & Sung solved the level-1 problem by observing that they could treat the maximal SN-sets like ‘meta-leaves’, thus reducing the problem to recursively solving the problem on the triplets induced by the maximal SN- sets. Our idea is similar, but SN-sets in level-2 networks are (unfortunately) rather more complex creatures than in level-1 networks. Definition: maximal SN-set

In a phylogenetic network N, a cut-edge (x,y) is an edge whose removal disconnects the (underlying) graph. A cut-edge (x,y) is said to be a trivial cut edge iff y is a leaf. A cut-edge (x,y) is said to be highest iff there is no cut-edge (p,q) such that there is a directed path from q to x in N. Definition: (highest) cut-edges

Fact. Let (x,y) be a highest cut-edge and let L’ be the set of leaves reachable from y. Let L* be a strict subset of L’. Then L* is not a maximal SN-set. Proof: the set of leaves reachable from a highest cut-edge (x,y), is itself an SN-set. Why? Because it is not possible for there to be leaves p,q in L’ and r outside L’ such that pr|q is in the set of triplets: the edge (x,y) forms a bottleneck and would have to be used twice. y x pqr prq L’ So: each maximal SN-set can be expressed as the union of the leaves reachable by one or more highest cut-edges.

Now, suppose we have a dense set of triplets T and there exists a level-2 network N such that all the triplets in T are consistent with N. (Of course we don’t know what N is yet…) Suppose we construct a partition P of L as follows. The blocks of P are the sets of leaves reachable from highest cut-edges in N. (Each maximal SN-set of N thus corresponds to one or more blocks in P.) Let T’ be the new set of triplets induced by the partition P. In other words, if we collapse the set of leaves below highest cut-edges into ‘meta-leaves’, T’ is the new set of triplets we get. (Nice property: the maximal SN-sets of T’ are in 1:1 correlation with the maximal SN-sets of T.) Critical fact 1: the only level-2 networks where all cut-edges are trivial, are simple level-2 networks. Critical fact 2: there exists some simple level-2 network N’ such that the triplets in T’ are consistent with N’. Furthermore, if we find such an N’, and then recursively construct networks within each meta-leaf, we obtain a network consistent with T! A first attempt at reducing the problem to simple level-2 networks

But….that’s a non-deterministic argument So, it looks like we can indeed reduce the problem – in some sense – to finding simple level-2 networks. But that analysis was based on knowing where the highest cut-edges are in a hypothetical solution N. And we don’t know N…this is precisely what we’re looking for! We can, however, compute the maximal SN-sets of the input triplet set T. We need to be able to say something more about how maximal SN-sets of T relate to highest cut-edges in hypothetical solutions. Then we can base the recursion on maximal SN-sets, instead of highest cut-edges.

Central Theorem (simplified). Suppose there is a dense triplet set T consistent with some simple level-2 network N. Then there exists a level-2 network N’ (not necessarily simple) such that, with the exception of perhaps one maximal SN-set with respect to T, every maximal SN-set appears below a single cut-edge in N’. The remaining, ‘odd-one-out’ maximal SN-set (if it exists) will be equal to the union of leaves below two cut-edges.

Observe how SN- set {C,G,F} has been ‘pushed’ below a single cut-edge. transformation

An existence argument If some solution N exists for T, then a simple level-2 solution N’ exists for T’ (induced by the highest cut-edges of N) where the maximal SN-sets of T’ are tightly correlated with the maximal SN-sets of T. Finding N’ gives the starting point for a solution to T. But by the Central Theorem, all (except maybe one) of the maximal SN-sets of N’ can be ‘pushed’ below highest cut-edges to give a solution N’’ for T’. If we re-expand all the meta-leaves of N’’, we obtain a new solution N* for T. Crucially, all (except maybe one) of the maximal SN-sets of T will be beneath single cut-edges in N*. The odd-one-out will be beneath two cut-edges. So if we substitute N* as N in the first step, we come to the following conclusion: We can find a solution for T by finding a simple level-2 solution for the set of triplets induced by the maximal SN-sets of T, and recursing. We need to correctly guess the ‘odd-one-out’ maximal SN-set, however, and split that into two meta- leaves. Fortunately we can just try splitting each maximal SN-set in turn.

subnetwork below highest cut-edge

subnetwork below highest cut-edge

transformation F G C S = {C,G,F} F G C

transformation F G C S = {C,G,F} F G C

transformation F G C S = {C,G,F} F G C

transformation F G C S = {C,G,F} F G C whole maximal SN-set is now below a cut- edge!

Finding simple level-2 networks So we know that, if we analyse the maximal SN-sets carefully, and construct an appropriate new set of triplets, we can recursively reduce the entire problem to finding simple level-2 networks. But how do we algorithmically construct a simple level-2 network that is consistent with a given dense set of triplets?

Suppose we can correctly ‘guess’ that leaf g hangs directly below a recombination node. If we remove g, and all triplets that contain g, then we know that a level-1 network must be possible on this new set of triplets (because now fewer recombination nodes are needed.)

Suppose we can correctly ‘guess’ that leaf g hangs directly below a recombination node. If we remove g, and all triplets that contain g, then we know that a level-1 network must be possible on this new set of triplets (because now fewer recombination nodes are needed.)

Suppose we subsequently guess that leaf h now hangs below a recombination node in the new network. If we remove h, and all triplets that contain h, then we know that a level-0 network must be possible on this new set of triplets (because now even fewer recombination nodes are needed.)

Suppose we subsequently guess that leaf h now hangs below a recombination node in the new network. If we remove h, and all triplets that contain g, then we know that a level-0 network must be possible on this new set of triplets (because now even fewer recombination nodes are needed.) In such a case the resulting tree is UNIQUE (J&S).

So now we have a tree. We are going to guess how to add leaf h back in, and then guess how to add leaf g back in. This guessing is not a problem because we can simply try all possibilities.

Adding leaf h back in.

And finally adding leaf g back in. g

Conclusions & open problems So we know how to efficiently construct level-2 networks from dense triplet sets. What’s next? Applicability: how useful is it? Initial implementation: programming and fine-tuning Improving running time: in the spirit of the “SN-tree” of J&S&N Complexity: what about level-3 and higher? Bounds: worst-case, best-case scenarios Building all networks Properties of output networks as function of input Different triplet restrictions Confidence: how good are the solutions? Exponential-time exact algorithms for NP-hard problems

Conclusions & open problems So we know how to efficiently construct level-2 networks from dense triplet sets. What’s next? Applicability: how useful is it? Initial implementation: programming and fine-tuning Improving running time: in the spirit of the “SN-tree” of J&S&N Complexity: what about level-3 and higher? Bounds: worst-case, best-case scenarios Building all networks Properties of output networks as function of input Different triplet restrictions Confidence: how good are the solutions? Exponential-time exact algorithms for NP-hard problems Thank you for your attention!