Worst-case optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks Jaroslaw Byrka 1,2, Steven Kelk 2, Katharina.

Slides:



Advertisements
Similar presentations
1 Decomposing Hypergraphs with Hypertrees Raphael Yuster University of Haifa - Oranim.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
On the Density of a Graph and its Blowup Raphael Yuster Joint work with Asaf Shapira.
Lauritzen-Spiegelhalter Algorithm
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Exact Inference in Bayes Nets
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
Complexity ©D Moshkovitz 1 Approximation Algorithms Is Close Enough Good Enough?
Combinatorial Algorithms
Discrete Structure Li Tak Sing( 李德成 ) Lectures
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
The number of edge-disjoint transitive triples in a tournament.
Planning under Uncertainty
What is an Algorithm? (And how do we analyze one?)
Computational problems, algorithms, runtime, hardness
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Discussion #36 Spanning Trees
Hardness Results for Problems P: Class of “easy to solve” problems Absolute hardness results Relative hardness results –Reduction technique.
Online Graph Avoidance Games in Random Graphs Reto Spöhel Diploma Thesis Supervisors: Martin Marciniszyn, Angelika Steger.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Randomized Algorithms and Randomized Rounding Lecture 21: April 13 G n 2 leaves
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Randomness in Computation and Communication Part 1: Randomized algorithms Lap Chi Lau CSE CUHK.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Hardness Results for Problems
Packing Element-Disjoint Steiner Trees Mohammad R. Salavatipour Department of Computing Science University of Alberta Joint with Joseph Cheriyan Department.
1 Joint work with Shmuel Safra. 2 Motivation 3 Motivation.
Phylogenetic trees Sushmita Roy BMI/CS 576
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Fixed Parameter Complexity Algorithms and Networks.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Advanced Algorithm Design and Analysis (Lecture 13) SW5 fall 2004 Simonas Šaltenis E1-215b
Approximating the Minimum Degree Spanning Tree to within One from the Optimal Degree R 陳建霖 R 宋彥朋 B 楊鈞羽 R 郭慶徵 R
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Constructing evolutionary trees from rooted triples Bang Ye Wu Dept. of Computer Science and Information Engineering Shu-Te University.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Greedy Algorithms and Matroids Andreas Klappenecker.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
An Introduction to Variational Methods for Graphical Models
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
ICS 253: Discrete Structures I Induction and Recursion King Fahd University of Petroleum & Minerals Information & Computer Science Department.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
CS6045: Advanced Algorithms NP Completeness. NP-Completeness Some problems are intractable: as they grow large, we are unable to solve them in reasonable.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Constructing a level-2 phylogenetic network from a dense set of input triplets Leo van Iersel 1, Judith Keijsper 1, Steven Kelk 2, Leen Stougie 12 (1)
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Approximation Algorithms based on linear programming.
Introduction Wireless Ad-Hoc Network  Set of transceivers communicating by radio.
Hongyu Liang Institute for Theoretical Computer Science Tsinghua University, Beijing, China The Algorithmic Complexity.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Approximating the MST Weight in Sublinear Time
Maximal Independent Set
Parameterised Complexity
CS 581 Tandy Warnow.
Introduction Wireless Ad-Hoc Network
CS 581 Tandy Warnow.
Switching Lemmas and Proof Complexity
Presentation transcript:

Worst-case optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks Jaroslaw Byrka 1,2, Steven Kelk 2, Katharina T. Hüber 3 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam (3) University of East Anglia (UEA), England Web:

Phylogenetic tree reconstruction Orangutan Gorilla ChimpanzeeHuman (This tree borrowed from a presentation by Tandy Warnow) Phylogenetic tree reconstruction is essentially the science of efficiently inferring and constructing plausible evolutionary trees when we only have limited input data about the ‘species’ concerned… At the intersection of biology, bioinformatics, computer science and mathematics.

Dominant methods in phylogenetic reconstruction  Character-based methods  Maximum Parsimony (= Minimum Steiner Tree)  Maximum Likelihood  Bayesian methods (Markov Chain Monte Carlo - MCMC)  Distance-based methods  Neighbour Joining  UPGMA  Triplet-based methods

Triplet-based methods (1) Triplet-based methods are used for constructing rooted evolutionary trees: there is a root (a hypothetical most-distant ancestor) and edges are directed, explicitly denoting the direction of evolution. The central idea: build a single, ‘big’ evolutionary tree for a set S of species by combining smaller evolutionary trees on subsets of S such that the big tree respects the structure of the smaller trees. In triplet-based methods, the small input trees are always defined on size-3 subsets of the species set S (and are called rooted triplets.)

Triplet-based methods (2) For example. Suppose I want to reconstruct a plausible evolution for the species set {w,x,y,z}. I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) zwxxyzyxwwzy algorithm wzxy solution

Triplet-based methods (2) For example. Suppose I want to reconstruct a plausible evolution for the species set {w,x,y,z}. I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) zwx xyzyxwwzy algorithm wzxy solution

Triplet-based methods (2) For example. Suppose I want to reconstruct a plausible evolution for the species set {w,x,y,z}. I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) zwx xyz yxwwzy algorithm wzxy solution

When trees fail The algorithm of Aho et al. (1981) can be used to construct a tree that is consistent with all the input rooted triplets, if one exists… But…what if the algorithm fails? Why might the algorithm fail? Possible reason 1: The underlying evolution is tree-like, but the input triplets contain errors. Possible reason 2: The triplets are correct, but the underlying evolution is not tree-like. Biological phenomena such as hybridization, horizontal gene transfer, recombination and gene duplication can lead to evolutionary scenarios that are not tree-like. Responses: try constructing a phylogenetic tree that maximises the number of input triplets it is consistent with, and/or try and construct not phylogenetic trees, but phylogenetic networks

Networks instead of trees xyzxzy For example, suppose the input is {xy|z, xz|y}. z x y

Networks instead of trees xyz xzy For example, suppose the input is {xy|z, xz|y}. z x y

Networks instead of trees xyz xzy For example, suppose the input is {xy|z, xz|y}. z x y

Level-k phylogenetic networks z x y root (only one!) leaf-vertex (labelled with species) split-vertex recombination-vertex A level-k phylogenetic network is a rooted, directed acyclic graph where every biconnected component (in the underlying undirected graph) contains at most k recombination vertices. This network here is a very simple example of a level-1 network. In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the alternative name “galled tree”.

leaf-vertex (labelled with species) Level-k phylogenetic networks z x y root (only one!) split-vertex recombination-vertex A level-k phylogenetic network is a rooted, directed acyclic graph where every biconnected component (in the underlying undirected graph) contains at most k recombination vertices. This network here is a very simple example of a level-1 network. In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the alternative name “galled tree”.

The complexity of “LEVEL-k” LEVEL-k Input: Set of rooted triplets T Output: A level-k network N consistent with all the triplets in T, or state that no such network exists. Complexity k=0In P (Aho et al. 1981) k=1NP-hard, but in P when T is “dense” (Jansson & Sung 2005) k=2NP-hard, but in P when T is “dense” (Van Iersel, Keijsper, Kelk, Stougie 2007) k>2??? idem ??? (general case is almost certainly NP-hard, but density?)

What about maximization? Gasieniec et al. (1999) showed how to find in polynomial time a tree that is consistent with at least 1/3 of the input triplets T. Is it possible to always find a tree that is consistent with > 1/3 of the input triplets? No. Let T 1 (n) be the full triplet set on n species. Contains triplets. For example, T 1 (4) = {ab|c, ac|b, cb|a, ab|d, ad|b, bd|a, ac|d, dc|a, ad|c, bc|d, bd|c, dc|b}. For a given three species, a tree is consistent with at most one triplet on those three species. So at most 1/3 of the triplets in T 1 (n) can be consistent with a tree. So for trees, and comparing with the upper bound |T|, 1/3 is worst case optimal.

Formalising the question Assuming that we restrict the set of phylogenetic networks to some subclass, what is the maximum value of 0 ≤ p ≤ 1 such that for every input set T of rooted triplets, there exists some network N(T) from the subclass such that at least p|T| of the triplets are consistent with N(T)? So for level-0 networks (trees), p=1/3. This can be trivially converted to a 3-approximation algorithm for the problem MAX-LEVEL-0, where MAX-LEVEL-k is defined as “Given a set of triplets T, what is the maximum number of triplets from T that some level-k network can be consistent with?” In general, having an algorithm that gets a fraction q of the input triplets, becomes a (1/q)-approximation for the MAX variant. (Better approximation factors for the MAX variant are probably possible, but none yet known!)

Determining the p-fraction for level-1 and higher For level-1, Jansson, Nguyen and Sung (2005) showed how to find in polynomial time a level-1 network consistent with at least 5/12 ≈ 0.416… of the input triplets. So for level-1, p ≥ 5/12 ≈ 0.416… They also showed, given the full triplet set T 1 (n) on n leaves, how to build an optimal level-1 network for those triplets i.e. no other level-1 network can be consistent with a higher fraction of T 1 (n). By counting they show that such optimal level-1 networks are consistent with a fraction approaching (from above) ≈ 0.488… of the input triplets, showing that, for level-1, p ≤ 0.488… Obvious questions: what is the true value of p for level-1? What about higher level networks? Are networks achieving the p-fraction always polynomial-time constructable? What is the role of the full triplet set in determining p? How about p as a function of n = the number of species?

The

Our result: p is defined by the full triplet set! Let N be a network that is consistent with a fraction p’ of the full triplet set T 1 (n). Then, for any arbitrary triplet input set T on n species, we can convert N in polynomial time into an isomorphic network N’(T) that is consistent with a fraction ≥ p’ of T. (The result also holds for weighted triplet sets.) All tree shapes (not just caterpillars) can be consistent with 1/3 of input triplets, because every tree is consistent with 1/3 of T 1 (n). We get a polynomial-time worst-case optimal algorithm for level-1 networks (for the |T| upper bound.) This means that we can always get at least 0.48… of the input triplets. With a customized derandomization we can do this in time O(|T|n 2 ). For level-2, we can in polynomial time always get at least 0.61 of the input. Is this bad news for the biological relevance of triplet methods and/or the level-k hierarchy?

The

Method: labelling an unlabelled network Suppose we know a network N that is consistent with a fraction p’ of the full triplet set T 1 (n). Let T be the input set of triplets, on n species. Note that if the species on the leaves of N are arbitrarily permuted, the resulting network is still consistent with a fraction p’ of T 1 – because all species in T 1 are indistinguishable. Hence, we can view N as an unlabelled network i.e. a network without species on the leaves. Only the shape of N is important. We argue that we can label the leaves of N with species in such a way that the resulting network N’, which will be isomorphic to N, is consistent with a fraction ≥ p’ of T. We use a probabilistic argument to argue the existence of such a labelling. We then use the method of conditional expectation to derandomize this i.e. so that the labelling can be found in polynomial time.

Choosing the labelling u.a.r. is good enough Suppose we know a network N that is consistent with a fraction p’ of the full triplet set T 1 (n). Let T be the input set of triplets, on n species. If we choose a random labelling of the leaves of N (i.e. randomly assign the n species from T to the n leaves of N) to get a network N’, the expected fraction of T that N’ is consistent with, is p’.

Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ? It is the probability that species x,y,z get mapped to leaves t 1, t 2, t 3 such that t 1 t 2 |t 3 is consistent with N.

Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ? It is the probability that species x,y,z get mapped to leaves t 1, t 2, t 3 such that t 1 t 2 |t 3 is consistent with N. x y z

Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ? It is the probability that species x,y,z get mapped to leaves t 1, t 2, t 3 such that t 1 t 2 |t 3 is consistent with N. y x z

Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ? It is the probability that species x,y,z get mapped to leaves t 1, t 2, t 3 such that t 1 t 2 |t 3 is consistent with N. y x z

For each “leaf triplet” t 1 t 2 |t 3 in N, there are 2(n-3)! labellings that map xy|z to that leaf triplet. A labelling that maps xy|z to a leaf triplet, cannot map xy|z to another leaf triplet. So the probability that the labelled network N’ is consistent with xy|z, is the probability that xy|z gets mapped to one of the leaf triplets in N. Hand- waving, the probability is thus:

So, for any triplet t in T, we expect a fraction p’ of triplet t to be in the labelled network N’ when N’ is made by randomly labelling N. Summing over all triplets, we get that the expected fraction of T consistent with N’, is also p’. We conclude that there exists some labelling of N that achieves a fraction ≥ p’. This proves that, for a subclass of networks, the p-fraction is indeed defined by the full triplet set, and that any network obtaining the p-fraction for the full triplet set, can be relabelled to obtain the p-fraction for an arbitrary input set T. But how to find in polynomial time the correct labelling for a given input set T? Derandomization by the method of conditional expectation.

An appropriate labelling can be found in time O(m 4 n 3 ) time, where m is the number of vertices in the unlabelled network N. We do this by labelling the leaves of N, one at a time. General idea: At a given iteration of the algorithm, let F be the set of leaves of N which have already been labelled with species. We then arbitrarily pick an unlabelled leaf t and add it to F, by labelling it. But how do we choose the species that labels it? We choose the species that maximises the expected fraction of T that the finished labelled network N’ will be consistent with, assuming the labelling of the leaves in F U {t} is fixed, and that the remaining leaves are labelled uniformly at random. The main point to observe is how the probabilities can be computed in polynomial time. Derandomizing: a sketch

We compute the probability for each triplet independently. E.g. consider a triplet xy|z. Suppose x and y have already been assigned to leaves. What is the probability that xy|z will be in, given that the remaining leaves are labelled u.a.r.? Simply try all possible ways of mapping z into the remaining leaves, and count the successful mappings. x y

We compute the probability for each triplet independently. E.g. consider a triplet xy|z. Suppose x and y have already been assigned to leaves. What is the probability that xy|z will be in, given that the remaining leaves are labelled u.a.r.? Simply try all possible ways of mapping z into the remaining leaves, and count the successful mappings. x y = bad leaves for z = good leaves for z

Jansson, Nguyen & Sung (2005) showed how to construct the galled caterpillar on n leaves, denoted C(n). This level-1 network C(n) has the property that no other network is consistent with a higher fraction of the full triplet set T 1 (n); it is thus in some sense optimal. It is easy to construct C(n) in time polynomial in n. Combining this with our generic derandomized labelling algorithm, we obtain a polynomial-time worst-case optimal algorithm for level-1. For level-1 networks, let us parameterize the p-fraction as a function of n, the number of species. Combining our result with that of J&N&S, we get: Worst-case optimal algorithm for level-1

The value p(n) seems to smoothly approach a horizontal asymptote of ≈0.4880… from above. With help from Mathematica and some insights into ‘good’ values of a we have bound p(n) below by 0.48 for all n.

The

The galled caterpillar C(17) Galled caterpillars have a very regular structure, and this allows us to do a faster, customized derandomization, in time O( |T|n 2 )

Level-2 Using a combination of our relabelling technique, Java programming, and Mathematica, we were easily (in one afternoon) able to prove a lower bound on p of 0.61 for level-2 networks. The real value of p for level-2 is probably somewhere around 2/3. But to prove that conclusively we need to know what optimal level-2 networks look like for the full triplet set! A nice challenge for someone...

Conclusions and open problems We have shown that all tree shapes are worst-case optimal; we have identified p(n) for level-1 networks, and given a lower bound on p for level-2. More generally: we show how, for any given subclass of networks, the p- fraction can be obtained by studying only the full triplet set and that (generic or customised) polynomial-time algorithms can be constructed around this. Obtaining (bounds on) p can also be a first step on the road to good approximation algorithms for the MAX variants; it gives a (1/p) approximation for the MAX variant. Significance for biology, for the triplet method, for the level-k hierarchy? Our result is probably bad news for the field (not much discriminatory power) What is the real value of p for level-2, and for higher level networks, and for other subclasses of networks? Confirming whether or not there are (in polynomial time) better approximation factors possible for the MAX variants than (1/p).