Download presentation
Presentation is loading. Please wait.
1
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 17.4-6: Strings and Evolutionary Trees Lecturer: Dr. Rose Slides by: Dr. Rose April 10, 2007
2
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem Centrality Four related tree problems: 1.Ultrametric 2.Additive 3.Binary perfect phylogeny 4.Tree compatibility All can be solved as ultrametric tree problems. Recall tree compatibility reduces to perfect phylogeny. Now we reduce additive tree & (binary) perfect phylogeny problems to the ultrametric tree problem.
3
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Goal: reduce additive tree problem to ultrametric problem Complexity: O(n 2 ) reduction Approach: create a matrix D that is ultrametric D is additive. We will start by describing a reduction that involves a tree T for D and T for D. We will then describe a direct reduction of D to D.
4
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Assume that D is additive. Assume that we know of an additive tree T for D Assume that each of the n taxa in D labels a leaf of T. Idea: label the nodes of T to create an ultrametric tree T. Q: How can we do this?
5
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees A: we will do the following: –Select one node as the root –Stretch the leaf edges so that they are equidistant from the root. Let v be the row of D containing the largest entry. Let m v denote the value of this entry. Select node v as the root of T. This creates a directed tree.
6
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Example: A is the row of D containing the largest entry. Select node A as the root of T.
7
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Stretch leaf edges: –for each leaf i, add m A – D(A, i) to the leaf edge. –Leaf edges are now equidistant from A.
8
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees The resulting tree T is: –a rooted edge-weighted tree –distance m v from root to every leaf –each internal node is equidistant to leaves in its subtree.
9
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Since each internal node is equidistant to the leaves in its subtree: Label each internal node by this unique distance. These labels can be used to define an ultrametric matrix D. D(i, j) is the label at the least common ancestor of leaves i and j in T. Q: How can we go directly from matrix D to matrix D without involving T and T?
10
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Consider leaves i & j in T: –Let node w be their least common ancestor –Let x be the distance from the root v to w. –Let y be the distance from node w to leaf i.
11
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Q: What is the distance from w to i in T? A: y + m v - D(v, i) in T. Q: Where does m v - D(v, i) come from? A: Recall we add m v - D(v, i) to stretch the leaf edges.
12
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Gusfield presents the following lemma: Without knowing T or T´ explicitly, we can deduce that D´(i, j) = m v + (D(i, j) - D(v, i) - D(v, j))/2 Q: Is this equation correct? D´(i, j) = m v + ((y + z) - (x + y) - (x + z))/2 ? D´(i, j) = m v + -2x/2 ? Should it instead be: D´(i, j) = 2m v + D(i, j) - D(v, i) - D(v, j)? i.e., D´(i, j) = 2m v - 2x? Probably, but it is not necessary for the reduction (slide 9)
13
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees This brings us to the following Theorem: If D is an additive matrix, then D´ is ultrametric, where D´(i, j) = m v + (D(i, j) - D(v, i) - D(v, j))/2 Proof. We’ve shown that: D´(i, j) = y + m v - D(v, i) y = D(v, i) – x x = (D(v, i) + D(v, j) - D(i, j))/2 Putting it altogether establishes the equation in the theorem. D´ satisfies the ultrametric requirement.
14
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Q: What is the value of y? A: y = D(v, i) - x. Q: What is the value of x in terms of values in D? A: x = (D(v, i) + D(v, j) - D(i, j))/2
15
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees So: D additive D´ ultrametric By contraposition: D´ ultrametric D additive Q: does D´ ultrametric D additive? A: Theorem: D´ ultrametric D additive Proof. (constructive) Let T ´´ be the ultrametric tree for D´ Assign weights to edges of T ´´ –Note: the sum of edges from a leaf to an ancestor must match the ancestor’s label. –For each edge (p, q), assign the weight |p-q|
16
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Assign weights to edges of T ´´ continued –Note the path distance between leaves (i, j) is twice the value labeling the least common ancestor –Hence, 2D´(i, j) = 2m v + D(i, j) - D(v, i) - D(v, j) –Now shrink the edge into each leaf i by m v - D(v, i) –The path from leaf i to leaf j is now D(i, j) The result is an additive tree for matrix D from D´’s ultrametric tree. Putting all of this together results in a method for contructing and additive tree for an additive matrix.
17
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Additive Tree Algorithm –Create matrix D´ from D. –Create ultrametric tree T ´´ from D´ –Create T from T ´´ Label edge (p, q) with the value |p-q| For each leaf i, shrink the leaf edge by m v - D(v, i) Note: no step takes more than O(n 2 ) time. Thm. An additive tree for an additive matrix can be constructed in O(n 2 ) time.
18
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Example: Given D, first find D´ Recall: D´(i, j) = m v + (D(i, j) - D(v, i) - D(v, j))/2
19
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Example: From D´ find T´´ Recall: label edge inner edges (p, q) by |p-q|
20
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Example: From T´´ find T Recall: shrink leaf edge i by m v - D(v, i)
21
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Additive Trees Example: Finally compare the derived T with the original tree as a sanity check.
22
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Perfect Phylogeny We now recast perfect phylogeny in terms of an ultrametric tree problem. Defn. D M – the n by n matrix of shared characters More formally: Given the n by m character matrix M, define the n by n matrix D M : for each pair of objects, set D M (p, q) to be the number of characters that p and q both possess.
23
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Perfect Phylogeny Lemma: If M has a perfect phylogeny, then D M is a min-ultrametric matrix. Proof: convert M’s perfect phylogeny T to a min- ultrametric tree for D M –Let T be the perfect phylogeny for M. –Label T’s root be zero. –Traverse T from top to bottom, for each node v: Let p v be the number labeling node v’s parent. Let e v be the # of characters labeling the edge into v. Label node v with the sum p v + e v
24
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Perfect Phylogeny –The label of node v is the number of characters common to all leaves in the subtree rooted at v. –if v is the immediate parent of leaves p and q, then the label of v is D M (p, q) –The numbers labeling nodes on any path from the root are strictly increasing. The result is an ultrametric tree for matrix D M.
25
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Perfect Phylogeny Algorithm: perfect phylogeny via ultrametrics: 1.Create matrix D M from M. 2.Attempt to create a min-ultrametric tree T´ from D M. If not possible, then M has no perfect phylogeny. 3.If T´ was successfully created in step 2: Attempt to label its edges with the m characters of M. If not possible, then M has no perfect phylogeny. O/w the modified T´ is the perfect phylogeny T. Note: T´ may be min-ultrametric but M may not have a perfect phylogeny, hence the check in step 3
26
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Ultrametric Problem: Perfect Phylogeny Final notes on the centrality ultrametric problem. We can see that the following problems: 1.perfect phylogeny 2.tree compatibility can be cast as ultrametric problems. This is not an efficient way to address these problems.
27
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Maximum Parsimony Maximum parsimony: Perfect phylogeny is a special instance Can be viewed as a Steiner tree problem on a hypercube Presentation Approach: Introduce Steiner trees Hypercube graphs Maximum parsimony as a Steiner tree problem
28
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Maximum Parsimony Definitions: Let N be a set of nodes Let E be a set undirected edges with non-negative weight Let G = (N, E) be an undirected graph Let X N be a subset of nodes. A Steiner tree ST for X is any connected subtree of G that contains all nodes of X and possibly nodes in N-X. Weighted Steiner Tree Problem: Given G and X, find the Steiner tree of minimum total weight.
29
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Maximum Parsimony More Definitions: A hypercube of dimension d is an undirected graph with 2 d nodes, labeled 0..2 d -1. Adjacent nodes differ in only one label bit position. The weighted Steiner tree problem on hypercubes: G must be a hypercube.
30
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Maximum Parsimony More Definitions: Maximum Parsimony: Occam’s razor applied to phylogenetic reconstruction. A preference for trees requiring fewer evolutionary events to explain data. Gusfield’s definition: The Maximum Parsimony problem is the unweighted Steiner tree problem on a d-dimensional hypercube.
31
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Maximum Parsimony More about the hypercube formulation of MP: –The X input taxa are described as d-length binary vectors. –Recall: adjacent nodes differ in only one label bit position. –Correspondingly, taxa that differ by a single mutation will be adjacent. Steiner tree of X nodes and l edges iff a corresponding phylogenetic tree that entails l character-state mutations.
32
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Steiner interpretation of Perfect Phylogeny Define a nontrivial binary character to be a character contained by some taxa but not all. Consider an MP dataset of d nontrivial binary characters Q: what is the minimal number of mutations in the MP tree? A: at least d.
33
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Steiner interpretation of Perfect Phylogeny Q: What is the relation to binary perfect phylogeny? A: the binary perfect phylogeny problem is equivalent to asking if there is an MP solution with a cost of exactly d. Q: What about generalized perfect phylogeny? A: It’s similar. The lower bound must reflect: –the number of character states in the input taxa. –a character having r states in the input taxa is allowed only r-1 transitions.
34
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Steiner interpretation of Perfect Phylogeny Complexity: No known efficient solution for Steiner tree problem on unweighted graphs. Polynomial time solution for generalized perfect phylogeny problem when r is fixed. this particular Steiner tree problem can be answer in polynomial time.
35
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Steiner interpretation of Perfect Phylogeny MP approximations: –The weighted Steiner tree problem on hypercubes is NP-hard. –There is an approximate method with an error bound of a factor of 11/6. –Also MST can be used to find a Steiner tree with weight less than twice the optimal Steiner tree.
36
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic Alignment Recall: phylogenetic alignment was discussed in section 14.8 The focus was on deriving a multiple alignment enlightened by evolutionary history. The tree focused emphasis on specific alignment groupings Internal node sequences were a secondary artifact
37
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic Alignment Phylogenetic alignment as a parsimony problem: In contrast: we are now interested in the internal sequences These sequences are waypoints in the evoutionary trajectory leading to the extant taxa phylogenetic alignment is thus a parsimony problem
38
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic Alignment Hypothesis: optimal phylogenetic alignment describes evolutionary history. Assumptions: –Edit distance realistically models evolutionary distance –Globally optimal phylogenetic alignment captures essence of the evolutionary process We will look at minimum mutation, a variant of phylogenetic alignment
39
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem Defn. minimum mutation problem – variant of phylogenetic alignment problem. Input comprised of: 1.Tree 2.Strings labeling the leaves 3.A multiple alignment of those strings
40
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem Q: If you are given the tree and the multiple alignment, what is left to compute? A: the mutations that accounts for the input data. These mutations should be: 1.minimum sequence of site mutations that is 2.compatible with the given tree and 3.the given multiple alignment.
41
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem Q: How is the input data used to determine the minimum sequence of mutations? 1.The multiple alignment associates each amino acid with a specific position. 2.The evolutionary history of the sequences is then treated as a combined but independent evolutionary history of each position. 3.The tree guides the order of mutations for each position.
42
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem Assumptions: –Each column of the alignment can be solved separately –The strings labeling inner nodes adhere to the same alignment The problem reduces to a computation at a single position.
43
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem Minimum mutation for a single position: Input: 1.rooted tree with n nodes 2.Each leaf is labeled by a single character Output: 1.Each interior node is labeled by a single character 2.The labeling minimizes the number of edges between nodes with different labels.
44
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem Algorithmic approach: Dynamic Programming Let T v denote the subtree rooted at node v Let C(v) be the cost of the optimal solution for T v Let C(v, x) be the cost when v must be labeled by x Let v i denote the i th child of node v Base case: for each leaf specify C(v) & C(v, x) x . C(v) = 0 & C(v, x) = 0 if leaf v is labeled by x. C(v, x) = if leaf v is not labeled by x.
45
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem When v is an internal node: The recurrence relations start from the base cases. Bottom up from leaves Backtracking is used to after all C(v,x) computed to extract the solution.
46
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem Backtracking process: The root is labeled by the character x s.t. C(r) = C(r,x) The traversal is then top-down If v is labeled x, then v i is labeled: character x if C(v i ) + 1 > C(v i,x) o/w character y such that C(v i ) = C(v i,y)
47
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem Let’s evaluate an example: C(v) = 0 & C(v, x) = 0 if leaf v is labeled by x, o/w C(v, x) = if leaf v is not labeled by x.
48
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Fitch-Hartigan minimum mutation problem Time complexity: Bottom-up portion –Let = | | –Each node is evaluate wrt each x –For n nodes this gives O(n ) The backtracking portion is O(n) Overall O(n )
49
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Maximum Parsimony Most widely used tree building algorithm Differs from distance-based algorithms: –Does not actually build trees from distances –Parsimony is used to compute the cost of a tree –A search strategy is used to search through all topologies –Goal: find the tree topology with the overall minimum cost
50
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Traditional Parsimony Algorithm: Traditional parsimony [Fitch 1971] Goal: count the number of substitutions at a site. Method: recursion, keeping track of –C, the current cost –R k, the residues at k, the current node
51
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Traditional Parsimony Algorithm: Traditional parsimony [Fitch 1971] C = 0, k = root/ initialize the cost and TP(k) { If k is a leaf then return x k R left = TP( k.left) R right = TP(k.right) if R left R right return R left R right else { C = C +1 return R left R right }}
52
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Traditional Parsimony Let’s evaluate an example: if R left R right return R left R right else C = C +1, return R left R right
53
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Traditional Parsimony There is a traceback procedure for finding ancestral assignments. Q: How do you think the traceback works? A: Start from the root: 1.Pick a residue 2.Pick the same residue for each child set if possible 3.If a child set does not contain the parent’s residue, randomly select a residue from its set.
54
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Traditional Parsimony Let’s perform the traceback on our example:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.