Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology.

Slides:



Advertisements
Similar presentations
Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau.
Advertisements

1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.
PLGW01 - September Inferring Phylogenies from LCA distances (back to the basics of distance-based phylogenetic reconstruction) Ilan Gronau Shlomo.
The Saitou&Nei Neighbor Joining Algorithm ©Shlomo Moran & Ilan Gronau.
CSC5160 Topics in Algorithms Tutorial 2 Introduction to NP-Complete Problems Feb Jerry Le
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Bioinformatics Algorithms and Data Structures
BNFO 602 Phylogenetics Usman Roshan.
Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM.
. Perfect Phylogeny Tutorial #11 © Ilan Gronau Original slides by Shlomo Moran.
. Distance-Based Phylogenetic Reconstruction ( part II ) Tutorial #11 © Ilan Gronau.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
The Theory of NP-Completeness
. Robustness to Noise in Distance-Based Phylogenetic Reconstruction Methods Tutorial #12 © Ilan Gronau.
NP-Complete Problems Problems in Computer Science are classified into
Linear Least Squares and its applications in distance matrix methods Presented by Shai Berkovich June, 2007 Seminar in Phylogeny, CS Based on the.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Phylogeny Tree Reconstruction
Distance-Based Phylogenetic Reconstruction Tutorial #8 © Ilan Gronau, edited by Itai Sharon.
. Robustness to Noise in Distance-Based Phylogenetic Reconstruction Methods Tutorial #13 © Ilan Gronau.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan Gronau.
1 Joint work with Shmuel Safra. 2 Motivation 3 Motivation.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
PHYLOGENETIC TREES Dwyane George February 24,
Tonga Institute of Higher Education Design and Analysis of Algorithms IT 254 Lecture 8: Complexity Theory.
The Complexity of Optimization Problems. Summary -Complexity of algorithms and problems -Complexity classes: P and NP -Reducibility -Karp reducibility.
The Neighbor Joining Tree-Reconstruction Technique Lecture 13 ©Shlomo Moran & Ilan Gronau.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Fabio Pardi PhD student in Goldman Group European Bioinformatics Institute and University of Cambridge, UK Joint work with: Barbara Holland, Mike Hendy,
Unit 9: Coping with NP-Completeness
1 Chapter 34: NP-Completeness. 2 About this Tutorial What is NP ? How to check if a problem is in NP ? Cook-Levin Theorem Showing one of the most difficult.
Lecture 6 NP Class. P = ? NP = ? PSPACE They are central problems in computational complexity.
CS 3343: Analysis of Algorithms Lecture 25: P and NP Some slides courtesy of Carola Wenk.
. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran.
CS6045: Advanced Algorithms NP Completeness. NP-Completeness Some problems are intractable: as they grow large, we are unable to solve them in reasonable.
Lecture 25 NP Class. P = ? NP = ? PSPACE They are central problems in computational complexity.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Probabilistic Equational Reasoning Arthur Kantor
Complexity ©D.Moshkovits 1 2-Satisfiability NOTE: These slides were created by Muli Safra, from OPICS/sat/)
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
The NP class. NP-completeness
More NP-Complete and NP-hard Problems
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Distance based phylogenetics
Lecture 2-2 NP Class.
NP-Completeness (36.4-5) P: yes and no in pt NP: yes in pt NPH  NPC
NP-Completeness (36.4-5/34.4-5)
NP-Completeness Yin Tat Lee
Lecture 24 NP-Complete Problems
ICS 353: Design and Analysis of Algorithms
Technion – Israel Institute of Technology
NP-Completeness Yin Tat Lee
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Perfect Phylogeny Tutorial #10
Presentation transcript:

Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology Haifa, Israel

Plgw03, 17/12/07 2 Pairwise-Distance Based Reconstruction L G E H M B DTDT Butt ’ fly … AAGT … Eagle … CAGA … Gorrila … CCGT … Human … AACG … Lion … AATA … Mouse … CGCG … B E G H L M D calculate BEGHML T 1 reconstruct B E G H L M

Plgw03, 17/12/07 3 Optimization Criteria We wish the tree-metric D T to approximate simultaneously the pairwise distances in D. Maximal Difference ( l ∞ ) Maximal Distortion Two “ closeness ” measures studied here: B E G H L M should be “close” to= DD T =

Plgw03, 17/12/07 4 Maximal Difference (l ∞ ) vs. Maximal Distortion B E G H L M Goal: Find optimal T, which minimizes the maximal difference/distortion between D and D T D =D T =

Plgw03, 17/12/07 5 Previous works on Approximating Dissimilarities by Tree Distances Negative results: (NP-hardness) Closest tree-metric (even ultrametric ) to dissimilarity matrix under l 1 l 2 [Day ‘87] Closest tree-metric to dissimilarity matrix under l ∞ [ABFPT99]  Hard to approximate better than  Implicit: Hard to approximate closest MaxDist tree within any constant factor Positive results: Closest ultrametric to dissimilarity matrix under l ∞ [Krivanek ‘88] 3-approximation of closest additive metric to a given metric [ABFPT99] (implicit 6-approximation for general dissimilarity matrices)

Plgw03, 17/12/07 6 This Work: Triplet-Distances – Distances to Triplets Midpoints i j k τ T (i ; jk) τ T (i ; jk) = τ T (i ; kj) τ T (i ; ij) = 0 τ T (i ; jj) = D T (i, j) C(i,j,k)

Plgw03, 17/12/07 7 Triplet-Distances Defined by 2-Distances Each distance Matrix D defines 3-trees i k j τ(i ; jk)= ½ [ D(i,j)+D(i,k)-D(j,k) ]. Any metric on 3 taxa… C(i,j,k) i j k …is realizable by a 3-tree

Plgw03, 17/12/07 8 reconstruct Triplet-Distance Based Reconstruction BEGHML T 1  … AAGT … … CAGA … … CCGT … … AACG … … AATA … … CGCG … B E G H L M BB BE BG….. LL LM MM T T B E G H L M BB BE BG….. LL LM MM τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

Plgw03, 17/12/07 9 Why use Triplet-Distances? 1. They enable more accurate estimations of 2-distances. 2. They are used (de facto) by known reconstruction algorithms

Plgw03, 17/12/07 10 Improved Estimations of Pairwise Distances: B E G H L M D= Butt ’ fly … AAGT … Eagle … CAGA … Gorrila … CCGT … Human … AACG … Lion … AATA … Mouse … CGCG … “Information Loss” (In calculating D(H,E), all other taxa are ignored Human … AACG … Eagle … CAGA … (Maximum Likelihood) H E 13 Calculate D(H,E)

Plgw03, 17/12/07 11 Improved Estimations (cont): Estimate D(H,E) by calculating all the 3-trees on {H,E,X:X  H,E} (Or: calculate just one 3-tree, for a “ trusted ” 3 rd taxon X : V. Ranwez, O. Gascuel, Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets, Mol.Biol. Evol. 19(11) 1952–1963. (2002) B=(..AAGT..) H= (..AACG..) E=(..CAGA..) 3 2 (..****..) M=(..CGCG..) 3 3 (..****..) H= (..AACG..) E=(..CAGA..) G=(..CCGT..) H= (..AACG..) E=(..CAGA..) 1 5 (..****..) L=(..AATA..) H= (..AACG..) E=(..CAGA..) 2 4 (..****..)

Plgw03, 17/12/07 12 (Implicit) use of Triplet-Distances in 2-Distance Reconstruction Algorithms  BB BE BG….. LL LM MM B E G H L M D BEGHML T 1 τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

Plgw03, 17/12/ st use : “ Triplet Distances from a Single Source ” : Fix a taxon r, and construct a tree T which minimizes: Optimal solution is doable in O(n 2 ) time, and is used eg in : (FKW95): Optimal approximation of distances by ultrametric trees. (ABFPT99): The best known approximation of distances by general trees (BB99): Fast construction of Buneman trees. i j r

Plgw03, 17/12/ nd use: Saitou&Nei Neighbour Joining The neighbors-selection criterion of NJ selects a taxon-pair i,j which maximizes the sum : i j rr r r r r r r

Plgw03, 17/12/07 15 Previous Works on Triplet-Dissimilarities/Distances I. Gronau, S. Moran Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances, Journal of Computational Biology 14(1) pp (2007). Works which use the total weights of 3 trees : S. Joly, GL Calve, Three Way Distances, Journal of Classification 12 pp (1995) L. Pachter, D. Speyer Reconstructing Trees from Subtrees Weights, Applied Mathematics Letters 17 pp (2004) D. Levy, R. Yoshida, L. Pachter, Beyond pairwise distances: Neighbor-joining with phylogenetic diversity estimates, Mol. Biol. Evol. 23(3) 491–498 (2006).

Plgw03, 17/12/07 16 Summary of Results Results for Maximal Difference ( l ∞ ): 1.Decision problem is NP-Hard  IS there a tree T s.t. ||τ,τ T || ∞ ≤ Δ ? 2.Hardness-of-approximation of optimization problem  Finding a tree T s.t. ||τ,τ T || ∞ ≤ 1.4||τ,τ OPT || ∞ 3.A 15-approximation algorithm  Using the 6-approximation algorithm for 2-dissimilarities from [ABFPT99] Result for Maximal Distortion : Hardness-of-approximation within any constant factor

Plgw03, 17/12/07 17 NP Hardness of the Decision Problem We use a reduction from 3SAT (the problem of determining whether a 3CNF formula is satisfiable) clause literals Satisfying assignment: If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τ T || ∞ ≤ Δ, then one can determine for every 3CNF formula φ whether it is satisfiable. We show:

Plgw03, 17/12/07 18 The Reduction The set of taxa: Taxa T, F. A taxon for every literal ( ). 3 taxa for every clause C j ( y j 1, y j 2, y j 3 ). Given a 3CNF formula φ we define triplet distances  and an error bound Δ which enforce the output tree to imply a satisfying assignment to φ.

Plgw03, 17/12/07 19 One the following can be enforced on each taxa triplet (u,v,w): 1.taxon u is close to Path(v,w), or 2.taxon u is far to Path(v,w) u Properties Enforced by the Input ( ,Δ) v w

Plgw03, 17/12/07 20 A truth assignment to φ is implied by the following: 1.T is far from F 2.For each i, is far from, and both of and are close to Path( T, F) T F Enforcing Truth Assignmaent Thus we set x i =T iff x i is close to T.

Plgw03, 17/12/07 21 A clause C=( l 1  l 2  l 3 ) is satisfied iff At least one literal l i is true, i.e. is close to T. Enforcing Clauses-Satisfaction F l 3l 3 l 1l 1 l 2l 2 ( l 1  l 2  l 3 ) is satisfied iff it is not like this We need to guarantee that all clauses avoid the above by the close/far relations.

Plgw03, 17/12/  ( l 1  l 2  l 3 ) is satisfied iff out of the three paths: Path( l 1, l 2 ), Path( l 1, l 3 ), Path( l 2, l 3 ), at least two paths are close to T. Clauses-Satisfaction (cont) T F l 1l 1 l 3l 3 l 2l 2 But we don’t know which two paths

Plgw03, 17/12/07 23 Clauses-Satisfaction (cont) We attach a taxon to each such path: y 1 is close to Path ( l 2,l 3 ) y 2 is close to Path ( l 1,l 3 ) y 3 is close to Path ( l 1,l 2 )  ( l 1  l 2  l 3 ) is satisfied iff at least two y i ’s can be located close to T.… T F l 1l 1 l 3l 3 l 2l 2 y1y1 y2y2 y3y3

Plgw03, 17/12/07 24 … and, at least two of the y i ’s can be located close to T Path ( y 2,y 3 ), Path ( y 1,y 3 ), Path ( y 1,y 2 ), are close to T Clauses-Satisfaction (end) So, (l 1  l 2  l 3 ) is satisfied iff all the above paths are close to T T F l 1l 1 l 3l 3 l 2l 2 y1y1 y2y2 y3y3

Plgw03, 17/12/07 25 vFvF vTvT TF 2β2β αα Construction Example α α y12y12 y11y11 y13y13 α y23y23 y21y21 α y22y22 φ is satisfiable  there is a tree T which satisfies all bounds A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α; τ T ( T ; y j 1 y j 2 ) ≤ α

Plgw03, 17/12/07 26 Hardness of Approximation Results Approximating Maximal Difference Finding a tree T s.t. ||τ,τ T || ∞ ≤ 1.4||τ,τ OPT || ∞ Approximating Maximal Distortion: Finding a tree T s.t. MaxDist(τ,τ T ) ≤ C MaxDist(τ,τ OPT ) for any constant C Details in: I. Gronau and S. moran, On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities, Theoretical Computer Science 389(1-2), December 2007, pp By “stretching” the close/far restrictions, the following problems are also shown NP hard:

Plgw03, 17/12/07 27 Open Problems/Further Research Extending hardness results for 3-diss tables induced by 2-diss matrices ( τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)] ) Extending hardness results for “ naturally looking ” trees ( binary trees with constant-bounded edge weights ) Check Performance of NJ when neighbor selection formula computed from “ real ” 3-distances. Devise algorithms which use 3-distances as input. Does optimization of 3-diss lead to good topological accuracy (under accepted models of sequence evolution) ( it is known that optimization of 2-diss doesn’t lead to good topological accuracy )

Plgw03, 17/12/07 28 Thank You

Distance-Based Phylogenetic Reconstruction Compute distances between all taxon-pairs Find a tree (edge-weighted) best-describing the distances

Plgw03, 17/12/07 30 Optimization Criteria Known measures of closeness: l ∞ - l p - MaxDist - ( where 0/0≡1 )

Plgw03, 17/12/07 31 The Reduction 3CNF formula φ τ 3-diss table φ is satisfiable, Δ There is a tree T s.t. ||τ,τ T || ∞ ≤ Δ If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τ T || ∞ ≤ Δ, then one can determine for every 3CNF formula φ whether it is satisfiable.

Plgw03, 17/12/07 32 The Reduction Define a set of lower and upper bounds: A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α; τ T ( T ; y j 1 y j 2 ) ≤ α φ τlτl, 2Δ2Δ τuτu

Plgw03, 17/12/07 33 The Reduction 3CNF formula φ τlτl φ is satisfiable, There is a tree T s.t. τ l ≤ τ T ≤ τ u 2Δ2Δ τuτu If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τ T || ∞ ≤ Δ, then one can determine for every 3CNF formula φ whether it is satisfiable.

Plgw03, 17/12/07 34 The Reduction φ τlτl, 2Δ2Δ τuτu 1. Define the set of taxa. 2. Define a set of lower and upper bounds on some entries of τ T. [ φ is satisfiable  there is a tree T which satisfies all bounds ] 3. Define Δ according to the slackness required for the proof of .

Plgw03, 17/12/07 35 The Reduction Define the set of taxa: Taxa T, F. A taxon for every literal ( ). 3 taxa for every clause ( y j 1, y j 2, y j 3 ). φ τlτl, 2Δ2Δ τuτu

Plgw03, 17/12/07 36 vφvφ vFvF vTvT T F β β ≥α The Analysis A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α ≤α  Trees satisfying A1 and A2 imply a truth-assignment to x 1,..., x n.

Plgw03, 17/12/07 37 The Analysis B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 2 ) ≤ α  B1 and B2 imply that y j a = l j b l j c for {a,b,c}={1,2,3}.  B3 implies that at least two of y j 1, y j 2, y j 3 are satisfied. vFvF F l1l1 l2l2 y3y3 vφvφ There is a tree T which satisfies all bounds  φ is satisfiable

Plgw03, 17/12/07 38 The Reduction – τ(φ) A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α; τ T ( T ; y j 1 y j 2 ) ≤ α vFvF vTvT TF 2β2β αα α α y12y12 y11y11 y13y13 α y23y23 y21y21 α y22y22 A1 τ( T, F ) = 2α+3β A2 i=1..n :τ( T ; ) = α-β ; τ( F ; ) = α-β B1 j=1..m :τ(y j 1 ; l j 2 l j 3 ) = α-β ; τ(y j 2 ; l j 1 l j 3 ) = α-β ; τ(y j 3 ; l j 1 l j 2 ) = α-β B2 j=1..m :τ(y j 1 ; T F ) = α+β ; τ(y j 2 ; T F ) = α+β ; τ(y j 3 ; T F ) = α+β B3 j=1..m :τ( T ; y j 2 y j 3 ) = α-β ; τ( T ; y j 1 y j 3 ) = α-β ; τ( T ; y j 1 y j 2 ) = α-β Other 2-distances: τ(s, t ) = 2α+2β Other 3-distances: τ(s ; t u ) = α+2β In our constructed tree: All 2-distances are in [ 2α, 2α+2β ]. All 3-distances are in [α, α+2β].  Δ=β. Δ=β.