Phylogenetic Trees Lecture 2

Slides:



Advertisements
Similar presentations
Great Theoretical Ideas in Computer Science for Some.
Advertisements

Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau.
WSPD Applications.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
Greedy Algorithms Greed is good. (Some of the time)
Phylogenetic Trees Lecture 12
. Intro to Phylogenetic Trees Lecture 5 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran. Slight modifications by Benny.
CompSci 102 Discrete Math for Computer Science April 19, 2012 Prof. Rodger Lecture adapted from Bruce Maggs/Lecture developed at Carnegie Mellon, primarily.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
The Saitou&Nei Neighbor Joining Algorithm ©Shlomo Moran & Ilan Gronau.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
CS Data Structures Chapter 10 Search Structures (Selected Topics)
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
Bioinformatics Algorithms and Data Structures
. Distance-Based Phylogenetic Reconstruction ( part II ) Tutorial #11 © Ilan Gronau.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
. Phylogenetic Trees Lecture 1 Credits: N. Friedman, D. Geiger, S. Moran,
1 Separator Theorems for Planar Graphs Presented by Shira Zucker.
Distance-Based Phylogenetic Reconstruction Tutorial #8 © Ilan Gronau, edited by Itai Sharon.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
Perfect Phylogeny MLE for Phylogeny Lecture 14
Algorithm Animation for Bioinformatics Algorithms.
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
9/1/ Ultrametric phylogenies By Sivan Yogev Based on Chapter 11 from “Inferring Phylogenies” by J. Felsenstein.
Approximating the Minimum Degree Spanning Tree to within One from the Optimal Degree R 陳建霖 R 宋彥朋 B 楊鈞羽 R 郭慶徵 R
Multiway Trees. Trees with possibly more than two branches at each node are know as Multiway trees. 1. Orchards, Trees, and Binary Trees 2. Lexicographic.
CS Data Structures Chapter 10 Search Structures.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
The Neighbor Joining Tree-Reconstruction Technique Lecture 13 ©Shlomo Moran & Ilan Gronau.
CS 473Lecture X1 CS473-Algorithms I Lecture X1 Properties of Ranks.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Induction Proof. Well-ordering A set S is well ordered if every subset has a least element. [0, 1] is not well ordered since (0,1] has no least element.
15-853Page :Algorithms in the Real World Planar Separators I & II – Definitions – Separators of Trees – Planar Separator Theorem.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Chapter 7. Trees Weiqi Luo ( 骆伟祺 ) School of Software Sun Yat-Sen University : Office : A309
CS 103 Discrete Structures Lecture 13 Induction and Recursion (1)
Chapter 8 Maximum Flows: Additional Topics All-Pairs Minimum Value Cut Problem  Given an undirected network G, find minimum value cut for all.
PC-Trees & PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran.
Great Theoretical Ideas in Computer Science for Some.
1 Assignment #3 is posted: Due Thursday Nov. 15 at the beginning of class. Make sure you are also working on your projects. Come see me if you are unsure.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
 2004 SDU 1 Lecture5-Strongly Connected Components.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
COMPSCI 102 Introduction to Discrete Mathematics.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
1 Lecture 5 (part 2) Graphs II (a) Circuits; (b) Representation Reading: Epp Chp 11.2, 11.3
Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
All-pairs Shortest paths Transitive Closure
dij(T) - the length of a path between leaves i and j
12. Graphs and Trees 2 Summary
Lectures on Network Flows
PC trees and Circular One Arrangements
COMP 6/4030 ALGORITHMS Prim’s Theorem 10/26/2000.
Greedy Algorithms / Minimum Spanning Tree Yin Tat Lee
Chapter 5. Optimal Matchings
Multi-Way Search Trees
Lectures on Graph Algorithms: searching, testing and sorting
Phylogeny.
Discrete Mathematics for Computer Science
Switching Lemmas and Proof Complexity
Presentation transcript:

Phylogenetic Trees Lecture 2 Based on: Durbin et al Section 7.3, 7.4, 7.8 .

The Four Points Condition Theorem: A set M of L objects is additive iff any subset of four objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l) We call {{i,j},{k,l}} the “split” of {i,j,k,l}. The four point condition doesn’t provides an algorithm to construct a tree from distance matrix, or to decide whether there is such a tree. The first methods for constructing trees for additive sets used neighbor joining methods:

Constructing additive trees: The neighbor joining problem Let i, j be neighboring leaves in a tree, let k be their parent, and let m be any other vertex. The formula shows that we can compute the distances of k to all other leaves. This suggest the following method to construct tree from a distance matrix: Find neighboring leaves i, j in the tree, Replace i, j by their parent k and recursively construct a tree T for the smaller set. Add i, j as children of k in T.

Neighbor Finding How can we find from distances alone a pair of nodes which are neighboring leaves? Closest nodes aren’t necessarily neighboring leaves. A B C D Next we show one way to find neighbors from distances.

Neighbor Finding: Saitou & Nei method Theorem [Saitou & Nei] Assume all edge weights are positive. If D(i, j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree. i j k l m T1 T2 The proof is rather involved !

Neighbor Joining Algorithm Set L to contain all leaves Iteration: Choose i, j such that D(i, j) is minimal Create new node k, and set Remove i, j from L, and add k Terminate: when |L| =2 , connect two remaining nodes i j k m

Saitou & Nei’s Idea: 2 5 • a b c d e f g 1 3 4 Let (i, j) = d(i, j) – (ri + rj) “ L-2 ” is crucial! D12 = (a+c+d) – (1/3)(a+b + a+c+d + a+c+e+f + a+c+e+g + d+c+a + d+c+b + d+e+f + d+e+g) D13 = (a+b) – (1/3)(a+b + a+c+d + a+c+e+f + a+c+e+f + b+a + b+c+d + b+c+e+f + b+c+e+g) Hence D12 - D13 = (4/3) c

Saitou & Nei’s proof Notations used in the proof : p(i, j) = the path from vertex i to vertex j; P(D,C) = (e1, e2, e3) = (D, E, F, C) For a vertex i, and an edge e=(i , j): Ni(e) = |{k : e is on p(i, k), k is a leave}|. e.g. ND(e1) = 3, ND(e2) = 2, ND(e3) = 1 NC(e1) = 1 A B C D e1 e3 e2 E F

Saitou & Nei’s proof: Crucial Observation Rest of T l k i j

Notation: |T| = #(leaves in T). Saitou & Nei’s proof Proof of Theorem: Assume for contradiction that D(i, j) is minimized for i, j which are not neighboring leaves. Let (i, l, ..., k, j) be the path from i to j. let T1 and T2 be the subtrees rooted at k and l which do not contain edges from P(i,j) (see figure). i j k l T1 T2 Notation: |T| = #(leaves in T).

(since for each edge eP(k,l), Nm(e) ≥ 2 and Ni(e)  L-2) Saitou & Nei’s proof Case 1: i or j has a neighboring leaf. WLOG j has a neighbor leaf m. A. D(i,j) - D(m,j)=(L-2)(d(i,j) - d(j,m) ) – (ri+rj) + (rm+ rj) =(L-2)(d(i,k)-d(k,m) )+rm-ri B. rm-ri ≥ (L-2)(d(k,m)-d(i,l)) + (4-L)d(k,l) (since for each edge eP(k,l), Nm(e) ≥ 2 and Ni(e)  L-2) Substituting B in A: D(i,j) - D(m,j) ≥ (L-2)(d(i,k)-d(i,l)) + (4-L)d(k,l) = 2d(k,l) > 0, contradicting the minimality assumption. i j k l m T2

Saitou & Nei’s proof Case 2: Not case 1. Then both T1 and T2 contain 2 neighboring leaves. WLOG |T2| ≥ |T1| . Let n,m be neighboring leaves in T1. We shall prove that D(m,n) < D(i,j), which will again contradict the minimality assumption. i j k l m n p T1 T2

Saitou & Nei’s proof A. 0 ≤ D(m,n) - D(i,j)= (L-2)(d(m,n) - d(i,j) ) + (ri+rj) – (rm+rn) B. rj-rm< (L-2)(d(j,k) – d(m,p)) + (|T1|-|T2|)d(k,p) C. ri-rn < (L-2)(d(i,k) – d(n,p)) + (|T1|-|T2|)d(l,p) Adding B and C, noting that d(l,p)>d(k,p): D. (ri+rj) – (rm+rn) < (L-2)(d(i,j)-d(n,m)) + 2(|T1|-|T2|)d(l,p) T1 n m p T2 k Substituting D in the right hand side of A: D(m,n ) - D(i,j)< 2(|T1|-|T2|)d(l,p) ≤ 0, as claimed. QED l i j

A simpler neighbor finding method Select an arbitrary node r. For each pair of labeled nodes (i, j) let C(i, j) be defined by the following figure: r C(i,j) j Claim: Let i, j be such that C(i, j) is maximized. Then i and j are neighboring leaves. i

Neighbor Joining Algorithm Set M to contain all leaves, and select a root r. |M|=L If L =2, return tree of two vertices Iteration: Choose i, j such that C(i, j) is maximal Create new vertex k, and set remove i, j, and add k to M Recursively construct a tree on the smaller set, then add i, j as children on k, at distances d(i,k) and d(j,k). i j k m

Complexity of Neighbor Joining Algorithm Naive Implementation: Initialization: Θ(L2) to compute the C(i, j)’s. Each Iteration: O(L) to update {C(i, k): i L} for the new node k. O(L2) to find the maximal C(i, j). Total of O(L3). i j k m

Complexity of Neighbor Joining Algorithm Using Heap to store the C(i, j)’s : Initialization: Θ(L2) to compute and heapify the C(i,j)’s. Each Iteration: O(1) to find the maximal C(i,j). O(L logL) to delete {C(m,i), C(m,j)} and add C(m,k) for all vertices m. Total of O(L2 log L). (implementation details are omitted)

Ultrametric trees as special weighted trees Definition: An Ultrametric tree is a rooted weighted tree all of whose leaves are at the same depth. Edge weights can be represented by the distances of internal vertices from the leaves. E.g., the tree produced by UPGMA. Note: each internal vertex has at least two children 8 A E D C B 5 3 0: 2

Ultrametric trees A more recent (and more efficient) way for constructing and identifying additive trees. Idea: Reduce the problem to constructing trees by the “heights” of the internal nodes. For leaves i, j, D(i, j) represent the “height” of the common ancestor of i and j. A E D C B 8 5 3

Ultrametric Trees Definition: T is an ultrametric tree for a symmetric positive real matrix D (called ultrmetric matrix) if: The leaves of T correspond to the rows and columns of D Internals nodes have at least two children, and the Least Common Ancestor of i and j is labeled by D(i, j). The labels decrease along paths from root to leaves A E D C B 8 5 3 A B C D E 8 5 3

Centrality of Ultrametric Trees We will study later the following question: Given a symmetric positive real matrix D, Is there an ultrametric tree T for D? But first we show ultrametric trees can be used to construct trees for additive sets and other related problems.

Transforming Ultrametric Trees to Weighted Trees Use the labels to define weights for all internal edges in the natural way. For this, consider the labels of leaves to be 0. We get an additive ultrametric tree whose height is the label of the root. E D C B 8 5 3 2 A Note that in this tree all leaves are at the same height. This is why it is called ultrametric.

Transforming Weighted Trees to Ultrametric Trees A weighted Tree T can be transformed to an ultrametric tree T’ as follows: Step 1: Pick a node k as a root, and “hang” the tree at k. a a c 2 k=a 2 4 1 3 3 b 2 1 b d 4 2 c d

Transforming Weighted Trees to Ultrametric Trees Step 2: Let M = maxid(i, k). M is taken to be the height of T’. Label the root by M, and label each internal node j by M - d(k, j). “ k ” is the root. a b c d 2 1 3 4 9 7 a c k = a, M = 9 2 4 3 2 1 b d

Transforming Weighted Trees to Ultrametric Trees Step 3: “Stretch” edges of leaves so that they are all at distance M from the root k 9 a b c d 7 9 4 2 3 a (9) 2 m 7 M = 9 1 i 3 b (6) 4 ( M-d(k,i) ) 4 2 (0) c d (2)

Re-constructing Weighted Trees from Ultrametric Trees Weight of an internal edge is the difference between the labels (heights) its endpoints. Assume that the distance matrix D = [d(i, j)] of the original unrooted tree is given. Weights of an edge to leaf i is obtained by subtracting “M - d(k, i)” from its current weight. k 9 a b c d 1 2 3 4 M = 9 9 (-9) m 7 a 7(-6) b i 4 4 4(-2) (M–d(k,m))–(M–d(k,i)) = d(i,m) c d

How D’ is constructed from D D’(i, j) should be the height of the Least Common Ancestor of i and j in T’, the ultrametric tree hanged at k. Let M = maxi d(i, k) and m is the LCA of i and j. Thus, D’(i, j) = M - d(k, m), where d(k, m) is computed by: k a b c d 2 1 3 4 9 7 m Note that this can be computed without the additive tree! i j

The transformation of D to D’ 9 M=9 T’ T 2 7 1 3 a b 4 b 4 2 c d c d Distance matrix D Ultrametric matrix D’ a b c d 3 9 7 8 6 a b c d 9 7 4

Identifying Ultrametric Trees Definition: A distance matrix D is ultrametric if for each 3 indices i, j, k D( i, j ) ≤ max {D( i, k ), D( j, k )}. (i.e., there is a tie for the maximum value) Theorem (U): D has an ultrametric tree iff it is ultrametric. (to be proved later)

Theorem: D is an additive distance matrix if and only if D’ is an ultrametric matrix. Note that the construction of D’ is independent of the additive tree. Proof. ( ) Use the conversion from an additive tree to an ultrametric tree and Theorem (U) . ( ) Use Theorem (U) and the conversion from an ultrametric tree to an additive tree and check that the additive tree indeed realizes the distance matrix.

Solving the Additive Tree Problem by the Ultrametric Problem: Outline We solve the additive tree problem by reducing it to the ultrametric problem as follows: Given an input matrix D = D(i, j) of distances, transform it to a matrix D’= D’(i, j) , where D’(i,j) is the height of the Least Common Ancestor of i and j in the corresponding ultrametric tree T’. (If not ultrametric, then the input matrix is not additive!) Construct the ultrametric tree, T’, for D’. Reconstruct the additive tree T from T’.

LCA and distances in Ultrametric Tree Let LCA(i, j) denote the lowest common ancestor of leaves i and j. Let D(i, j) be the height of LCA(i, j), and dist(i,j) be the distance from i to j. Claim: For any pair of leaves i, j in an ultrametric tree: D(i, j)= 0.5 dist(i, j). A B C D E 8 5 3 A E D C B 8 5 3

Identifying Ultrametric Distances Definition: A distance matrix D of dimension L by L is ultrametric iff for each 3 indices i, j, k : D( i, j ) ≤ max { D( i, k ), D( j, k ) }. j k i 9 6 Theorem(U): The following conditions are equivalent for an LL symmetric matrix D: D is ultrametric There is an ultrametric tree of L leaves such that for each pair of leaves i, j : D(i, j) = height(LCA(i, j)) = ½ dist(i, j). Note: D(i, j) ≤ max {D(i, k), D(j, k)} is easier to check than the 4 points condition. Therefore the theorem implies that ultrametric sets are easier to characterize then an additive sets.

Properties of ultrametric matrix used in the Proof of the Theorem (U) Definition: Let D be an L by L matrix, and let S  {1,...,L}. D[S] is the submatrix of D consisting of the rows and columns with indices from S. Claim 1: D is ultrametric iff for every S  {1,...,L}, D[S] is ultrametric. Claim 2: If D is ultrametric and maxi,jD(i, j)=m, , then m appears in every row of D above the row where the max occurs. j k ? m One of the “?” Must be m

Ultrametric tree Ultrametric matrix There is an ultrametric tree s.t. D(i, j) = ½ dist(i, j).  D is an ultrametric matrix: By properties of Least Common Ancestors in trees D(k, i) = D(j, i) ≥ D(k, j) i k j

Ultrametric matrix  Ultrametric tree Proof of D is an ultrametric matrix D has an ultrametric tree : By induction on L, the size of D. Basis: L= 1: T is a leaf L= 2: T is a tree with two leaves i i  i i j 9 j i i 9 j

Induction step Inductive Hyp.: Assume that it’s true for 1, 2, … , L-1. Induction step: L > 2. Let m = m1 be the maximum distance. Let Si ={l: D(1, l) = mi}, and { S1 , S2 , … Sk } form a partition of the leaves into k classes. (note: |S1| > 0) By Claim 1, D[Si], i = 1, 2, …, k are all ultrametric and hence we can construct tree T1 for S1, rooted at m and trees Ti for Si with root labeled mi < m for i = 2, …, k. (if mi = 0 then Ti is a leaf).

Notice that on any ultrametric tree the path from the root to the leave “1” must have exactly k+1 nodes, where k is the number of classes. Each node on this path must be labeled by one of the distinct entries in row 1, and those labels must appear in decreasing order on the path. 1 3,7 2,4,6 5,8 6 4 3 T1 T2 T3 1 2 3 4 5 6 7 8 1 2 3 4 . 0 4 3 4 6 4 3 6 0 4 2 6 1 4 6

Correctness Proof By Inductive Hypothesis, Ti ’s are all ultrmetric trees, and we assemble them along the path from the root to leave “1” to form the tree T. To prove that T is an ultrametric tree for D, need to check that D(i, j) is the label of the LCA of i and j in T. If i and j are in the same subtree, this holds by induction; otherwise the label of the node that the higher tree attaches to the path, which is the LCA, is indeed D(i, j) . QED

f (L) ≤ maxk[ f (k) + f (L-k)] +cL, 0 < k < L. Complexity Analysis Let f (L) be the time complexity for L×L matrix. f (1)= f (2) = constant. For L > 2: Constructing S1 and S2: O(L). Let |S1| = k, |S2| = L-k. Constructing T1 and T2: f (k) + f (L-k). Joining T1 and T2 to T: Constant. Thus we have: f (L) ≤ maxk[ f (k) + f (L-k)] +cL, 0 < k < L. f (L) = cL2 satisfies the above. Need an appropriate data structure!

Recall: identifying Additive Trees via Ultrametric trees We solve the additive tree problem by reducing it to the ultrametric problem as follows: Given an input matrix D = D(i, j) of distances, transform it to a matrix D’= D’(i, j), where D’(i, j) is the height of the LCA of i and j in the corresponding ultrametric tree T’. Construct the ultrametric tree, T’, for D’. Reconstruct the additive tree T from T’.

How D’ is constructed from D D’(i, j) should be the height of the Least Common Ancestror of i and j in T’, the ultrametric tree hanged at k: Thus, D’(i,j) = M - d(k, m), where d(k, m) is computed by: 9 a 2 7 1 3 b 4 2 c d

The transformation D  D’  T’T b c d 2 1 3 4 9 M=9 T’ T 7 a 4 b c d D’ D a b c d 3 9 7 8 6 a b c d 9 7 4