Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology Haifa, Israel
Plgw03, 17/12/07 2 Pairwise-Distance Based Reconstruction L G E H M B DTDT Butt ’ fly … AAGT … Eagle … CAGA … Gorrila … CCGT … Human … AACG … Lion … AATA … Mouse … CGCG … B E G H L M D calculate BEGHML T 1 reconstruct B E G H L M
Plgw03, 17/12/07 3 Optimization Criteria We wish the tree-metric D T to approximate simultaneously the pairwise distances in D. Maximal Difference ( l ∞ ) Maximal Distortion Two “ closeness ” measures studied here: B E G H L M should be “close” to= DD T =
Plgw03, 17/12/07 4 Maximal Difference (l ∞ ) vs. Maximal Distortion B E G H L M Goal: Find optimal T, which minimizes the maximal difference/distortion between D and D T D =D T =
Plgw03, 17/12/07 5 Previous works on Approximating Dissimilarities by Tree Distances Negative results: (NP-hardness) Closest tree-metric (even ultrametric ) to dissimilarity matrix under l 1 l 2 [Day ‘87] Closest tree-metric to dissimilarity matrix under l ∞ [ABFPT99] Hard to approximate better than Implicit: Hard to approximate closest MaxDist tree within any constant factor Positive results: Closest ultrametric to dissimilarity matrix under l ∞ [Krivanek ‘88] 3-approximation of closest additive metric to a given metric [ABFPT99] (implicit 6-approximation for general dissimilarity matrices)
Plgw03, 17/12/07 6 This Work: Triplet-Distances – Distances to Triplets Midpoints i j k τ T (i ; jk) τ T (i ; jk) = τ T (i ; kj) τ T (i ; ij) = 0 τ T (i ; jj) = D T (i, j) C(i,j,k)
Plgw03, 17/12/07 7 Triplet-Distances Defined by 2-Distances Each distance Matrix D defines 3-trees i k j τ(i ; jk)= ½ [ D(i,j)+D(i,k)-D(j,k) ]. Any metric on 3 taxa… C(i,j,k) i j k …is realizable by a 3-tree
Plgw03, 17/12/07 8 reconstruct Triplet-Distance Based Reconstruction BEGHML T 1 … AAGT … … CAGA … … CCGT … … AACG … … AATA … … CGCG … B E G H L M BB BE BG….. LL LM MM T T B E G H L M BB BE BG….. LL LM MM τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].
Plgw03, 17/12/07 9 Why use Triplet-Distances? 1. They enable more accurate estimations of 2-distances. 2. They are used (de facto) by known reconstruction algorithms
Plgw03, 17/12/07 10 Improved Estimations of Pairwise Distances: B E G H L M D= Butt ’ fly … AAGT … Eagle … CAGA … Gorrila … CCGT … Human … AACG … Lion … AATA … Mouse … CGCG … “Information Loss” (In calculating D(H,E), all other taxa are ignored Human … AACG … Eagle … CAGA … (Maximum Likelihood) H E 13 Calculate D(H,E)
Plgw03, 17/12/07 11 Improved Estimations (cont): Estimate D(H,E) by calculating all the 3-trees on {H,E,X:X H,E} (Or: calculate just one 3-tree, for a “ trusted ” 3 rd taxon X : V. Ranwez, O. Gascuel, Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets, Mol.Biol. Evol. 19(11) 1952–1963. (2002) B=(..AAGT..) H= (..AACG..) E=(..CAGA..) 3 2 (..****..) M=(..CGCG..) 3 3 (..****..) H= (..AACG..) E=(..CAGA..) G=(..CCGT..) H= (..AACG..) E=(..CAGA..) 1 5 (..****..) L=(..AATA..) H= (..AACG..) E=(..CAGA..) 2 4 (..****..)
Plgw03, 17/12/07 12 (Implicit) use of Triplet-Distances in 2-Distance Reconstruction Algorithms BB BE BG….. LL LM MM B E G H L M D BEGHML T 1 τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].
Plgw03, 17/12/ st use : “ Triplet Distances from a Single Source ” : Fix a taxon r, and construct a tree T which minimizes: Optimal solution is doable in O(n 2 ) time, and is used eg in : (FKW95): Optimal approximation of distances by ultrametric trees. (ABFPT99): The best known approximation of distances by general trees (BB99): Fast construction of Buneman trees. i j r
Plgw03, 17/12/ nd use: Saitou&Nei Neighbour Joining The neighbors-selection criterion of NJ selects a taxon-pair i,j which maximizes the sum : i j rr r r r r r r
Plgw03, 17/12/07 15 Previous Works on Triplet-Dissimilarities/Distances I. Gronau, S. Moran Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances, Journal of Computational Biology 14(1) pp (2007). Works which use the total weights of 3 trees : S. Joly, GL Calve, Three Way Distances, Journal of Classification 12 pp (1995) L. Pachter, D. Speyer Reconstructing Trees from Subtrees Weights, Applied Mathematics Letters 17 pp (2004) D. Levy, R. Yoshida, L. Pachter, Beyond pairwise distances: Neighbor-joining with phylogenetic diversity estimates, Mol. Biol. Evol. 23(3) 491–498 (2006).
Plgw03, 17/12/07 16 Summary of Results Results for Maximal Difference ( l ∞ ): 1.Decision problem is NP-Hard IS there a tree T s.t. ||τ,τ T || ∞ ≤ Δ ? 2.Hardness-of-approximation of optimization problem Finding a tree T s.t. ||τ,τ T || ∞ ≤ 1.4||τ,τ OPT || ∞ 3.A 15-approximation algorithm Using the 6-approximation algorithm for 2-dissimilarities from [ABFPT99] Result for Maximal Distortion : Hardness-of-approximation within any constant factor
Plgw03, 17/12/07 17 NP Hardness of the Decision Problem We use a reduction from 3SAT (the problem of determining whether a 3CNF formula is satisfiable) clause literals Satisfying assignment: If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τ T || ∞ ≤ Δ, then one can determine for every 3CNF formula φ whether it is satisfiable. We show:
Plgw03, 17/12/07 18 The Reduction The set of taxa: Taxa T, F. A taxon for every literal ( ). 3 taxa for every clause C j ( y j 1, y j 2, y j 3 ). Given a 3CNF formula φ we define triplet distances and an error bound Δ which enforce the output tree to imply a satisfying assignment to φ.
Plgw03, 17/12/07 19 One the following can be enforced on each taxa triplet (u,v,w): 1.taxon u is close to Path(v,w), or 2.taxon u is far to Path(v,w) u Properties Enforced by the Input ( ,Δ) v w
Plgw03, 17/12/07 20 A truth assignment to φ is implied by the following: 1.T is far from F 2.For each i, is far from, and both of and are close to Path( T, F) T F Enforcing Truth Assignmaent Thus we set x i =T iff x i is close to T.
Plgw03, 17/12/07 21 A clause C=( l 1 l 2 l 3 ) is satisfied iff At least one literal l i is true, i.e. is close to T. Enforcing Clauses-Satisfaction F l 3l 3 l 1l 1 l 2l 2 ( l 1 l 2 l 3 ) is satisfied iff it is not like this We need to guarantee that all clauses avoid the above by the close/far relations.
Plgw03, 17/12/ ( l 1 l 2 l 3 ) is satisfied iff out of the three paths: Path( l 1, l 2 ), Path( l 1, l 3 ), Path( l 2, l 3 ), at least two paths are close to T. Clauses-Satisfaction (cont) T F l 1l 1 l 3l 3 l 2l 2 But we don’t know which two paths
Plgw03, 17/12/07 23 Clauses-Satisfaction (cont) We attach a taxon to each such path: y 1 is close to Path ( l 2,l 3 ) y 2 is close to Path ( l 1,l 3 ) y 3 is close to Path ( l 1,l 2 ) ( l 1 l 2 l 3 ) is satisfied iff at least two y i ’s can be located close to T.… T F l 1l 1 l 3l 3 l 2l 2 y1y1 y2y2 y3y3
Plgw03, 17/12/07 24 … and, at least two of the y i ’s can be located close to T Path ( y 2,y 3 ), Path ( y 1,y 3 ), Path ( y 1,y 2 ), are close to T Clauses-Satisfaction (end) So, (l 1 l 2 l 3 ) is satisfied iff all the above paths are close to T T F l 1l 1 l 3l 3 l 2l 2 y1y1 y2y2 y3y3
Plgw03, 17/12/07 25 vFvF vTvT TF 2β2β αα Construction Example α α y12y12 y11y11 y13y13 α y23y23 y21y21 α y22y22 φ is satisfiable there is a tree T which satisfies all bounds A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α; τ T ( T ; y j 1 y j 2 ) ≤ α
Plgw03, 17/12/07 26 Hardness of Approximation Results Approximating Maximal Difference Finding a tree T s.t. ||τ,τ T || ∞ ≤ 1.4||τ,τ OPT || ∞ Approximating Maximal Distortion: Finding a tree T s.t. MaxDist(τ,τ T ) ≤ C MaxDist(τ,τ OPT ) for any constant C Details in: I. Gronau and S. moran, On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities, Theoretical Computer Science 389(1-2), December 2007, pp By “stretching” the close/far restrictions, the following problems are also shown NP hard:
Plgw03, 17/12/07 27 Open Problems/Further Research Extending hardness results for 3-diss tables induced by 2-diss matrices ( τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)] ) Extending hardness results for “ naturally looking ” trees ( binary trees with constant-bounded edge weights ) Check Performance of NJ when neighbor selection formula computed from “ real ” 3-distances. Devise algorithms which use 3-distances as input. Does optimization of 3-diss lead to good topological accuracy (under accepted models of sequence evolution) ( it is known that optimization of 2-diss doesn’t lead to good topological accuracy )
Plgw03, 17/12/07 28 Thank You
Distance-Based Phylogenetic Reconstruction Compute distances between all taxon-pairs Find a tree (edge-weighted) best-describing the distances
Plgw03, 17/12/07 30 Optimization Criteria Known measures of closeness: l ∞ - l p - MaxDist - ( where 0/0≡1 )
Plgw03, 17/12/07 31 The Reduction 3CNF formula φ τ 3-diss table φ is satisfiable, Δ There is a tree T s.t. ||τ,τ T || ∞ ≤ Δ If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τ T || ∞ ≤ Δ, then one can determine for every 3CNF formula φ whether it is satisfiable.
Plgw03, 17/12/07 32 The Reduction Define a set of lower and upper bounds: A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α; τ T ( T ; y j 1 y j 2 ) ≤ α φ τlτl, 2Δ2Δ τuτu
Plgw03, 17/12/07 33 The Reduction 3CNF formula φ τlτl φ is satisfiable, There is a tree T s.t. τ l ≤ τ T ≤ τ u 2Δ2Δ τuτu If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τ T || ∞ ≤ Δ, then one can determine for every 3CNF formula φ whether it is satisfiable.
Plgw03, 17/12/07 34 The Reduction φ τlτl, 2Δ2Δ τuτu 1. Define the set of taxa. 2. Define a set of lower and upper bounds on some entries of τ T. [ φ is satisfiable there is a tree T which satisfies all bounds ] 3. Define Δ according to the slackness required for the proof of .
Plgw03, 17/12/07 35 The Reduction Define the set of taxa: Taxa T, F. A taxon for every literal ( ). 3 taxa for every clause ( y j 1, y j 2, y j 3 ). φ τlτl, 2Δ2Δ τuτu
Plgw03, 17/12/07 36 vφvφ vFvF vTvT T F β β ≥α The Analysis A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α ≤α Trees satisfying A1 and A2 imply a truth-assignment to x 1,..., x n.
Plgw03, 17/12/07 37 The Analysis B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 2 ) ≤ α B1 and B2 imply that y j a = l j b l j c for {a,b,c}={1,2,3}. B3 implies that at least two of y j 1, y j 2, y j 3 are satisfied. vFvF F l1l1 l2l2 y3y3 vφvφ There is a tree T which satisfies all bounds φ is satisfiable
Plgw03, 17/12/07 38 The Reduction – τ(φ) A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α; τ T ( T ; y j 1 y j 2 ) ≤ α vFvF vTvT TF 2β2β αα α α y12y12 y11y11 y13y13 α y23y23 y21y21 α y22y22 A1 τ( T, F ) = 2α+3β A2 i=1..n :τ( T ; ) = α-β ; τ( F ; ) = α-β B1 j=1..m :τ(y j 1 ; l j 2 l j 3 ) = α-β ; τ(y j 2 ; l j 1 l j 3 ) = α-β ; τ(y j 3 ; l j 1 l j 2 ) = α-β B2 j=1..m :τ(y j 1 ; T F ) = α+β ; τ(y j 2 ; T F ) = α+β ; τ(y j 3 ; T F ) = α+β B3 j=1..m :τ( T ; y j 2 y j 3 ) = α-β ; τ( T ; y j 1 y j 3 ) = α-β ; τ( T ; y j 1 y j 2 ) = α-β Other 2-distances: τ(s, t ) = 2α+2β Other 3-distances: τ(s ; t u ) = α+2β In our constructed tree: All 2-distances are in [ 2α, 2α+2β ]. All 3-distances are in [α, α+2β]. Δ=β. Δ=β.