Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 17.2-3: Strings and.

Similar presentations


Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 17.2-3: Strings and."— Presentation transcript:

1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 17.2-3: Strings and Evolutionary Trees Lecturer: Dr. Rose Slides by: Dr. Rose April 10, 2003

2 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Next Homework: Due 4/22/03 #3 #8 Note: there are three parts to this question. Only answer the first two parts, i.e., 1.show that if D is ultrametric and D(i, i)=0 for each i, then D is also additive. 2.Show the converse is not true. #14 (Grad students only.)

3 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Real data is rarely ultrametric. A weaker constraint is that data be additive. Recall: Additive distances are distances which: –can be fitted to an unrooted tree such that –pairwise taxa distances are equal to the sum of the branch lengths connecting them.

4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Consider the relationship between additive and ultrametric trees: –Q: Are all ultrametric trees additive? –A: Yes. –Q: Are all additive trees ultrametric? –A: No.

5 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Assuming that: – D is a symmetric n by n distance matrix – D contains only zero values on the diagonal – D contains only positive off-diagonal values – T is an n node tree, then: Defn. T is an additive tree for D if, for every pair of labeled nodes (i, j), the path from i to j has total weight exactly D(i, j).

6 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Additive tree problem, given: 1.symmetric matrix D 2.zero entries on the diagonal 3.Positive off-diagonal values Find additive tree T for D or determine that one does not exist. Imagine that you have a distance matrix D representing evolutionary distance between pairs of taxa. Q: Do you expect the additive tree for D to be unique?

7 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Q: Is there a unique additive tree for D? If you think the answer is yes, why? If you think the answer is no, why? Consider what we know about D and T: 1.Is T’s branching pattern is consistent with D? (y/n) 2.Are the edge lengths in T consistent with D? (y/n) 3.Does D specifies directed edges? (y/n) 4.Does D imply directed edges? (y/n)

8 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees (Table and figure from http://imbs.massey.ac.nz/Research/MolEvol/Farside/DNA/00312.html)

9 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Concept: Additive tree problem Given:n by n symmetrical matrix D with zero diagonal entries positive off-diagonal values Find: additive tree with exactly n nodes.

10 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Concept: Compact Additive tree problem Given:n by n symmetrical matrix D with zero diagonal entries positive off-diagonal values Find: additive tree with exactly n nodes. Q: What does this definition say about the topology of the tree? A: For every node, there must be a corresponding row in D.

11 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Consider the symmetrical matrix D above and the tree T. Q: Is T an additive tree for D? Q: Is T a compact additive tree for D?

12 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Consider the symmetrical matrix D above. Q: Does D have a additive tree? Q: Does D have a compact additive tree?

13 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Defn. Let G(D) be the n-node complete graph corresponding to D where nodes are labeled 1 – n and edges have weight D(i, j). Thm. If there is a compact additive tree T for D, then T must be the unique minimum spanning tree of G(D).

14 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Proof. Let: T be a compact additive tree for D. e = (x, y) be any edge not in T. We know: The path from x to y in T is D(x, y) The edge weight for e is also D(x, y) Since e is not in T, e is strictly greater than any edge in the path from x to y in T.

15 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Proof. continued Assume that there is some other minimum spanning tree T´ containing e. Removing e splits T´ into two sets of nodes, S & S´. WLOG, S contains x & S´ contains y. In T there is an edge e´ that connects the nodes in S & S´ Furthermore, e´ is on the path from x to y in T. Hence e´ < e.

16 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees Proof. continued Create a new spanning tree T´´ by removing e from T´ and adding e´. The edge weight of T´´ is less than that of T´. This contradicts the assumption that T´ is a minimum spanning tree. T must itself be the unique minimum spanning tree of G(D).

17 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Additive Distance Trees How can we use this theorem to solve the compact additive tree problem in O(n 2 ) time? Answer: 1.Construct G(D) from D. 2.Use an O(n 2 ) mst algorithm, such as Prim’s algorithm, that extends a single growing tree T. a.When an edge e = (x, y) is added to T, and x is already in T. b.Compute d(i, y) = d(i, x) + D(x, y) for all i in T. This takes O(n) per iteration and O(n 2 ) for all of T. c.Verify d(i, y) = D(i, y)

18 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Q: What is parsimony? A: Parsimony: extreme or excessive frugality. Q: So what does frugal mean? A: Frugal: thrifty, economical. In this chapter, parsimony is a character-based method for reconstructing evolutionary history. Characters are attributes, traits In this section we will look at highly constrained trees that express evolutionary history.

19 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony 1.Can be used to deduce evolutionary trees Specifies branching order Does not specify divergence times 2.Can be used as basis for a taxonomy This section is a limited introduction to maximum parsimony problems: Binary-character problems Focus on perfect phylogeny problem

20 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Defn. Let M be an n by m, binary matrix representing n objects with m character traits. –Since M is binary, each character trait has two possible states, 1 or 0. –Cell (p, i) of M has value 1 iff object p has character i. M has a flavor similar to the old chestnut animal guessing program that uses a binary tree.

21 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Defn. a phylogenetic tree for M is a rooted tree T with exactly n leaves such that: 1.Each of the n leaves is labeled by exactly one object. 2.Each of the m character-traits labels exactly one edge of T. 3.For any object p, the character-traits labeling the edges along the path from the root to p are exactly those character-traits whose state is one.

22 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Consider the matrices below: 1.do either M 1 or M 2 have a phylogenetic tree? 2.If so, what does the tree look like?

23 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Q: What is the interpretation of the phylogenetic tree? A: It is an estimate of the divergent evolutionary history of the objects. (does not give time) 1.The root represents an ancestor with none of the m character-traits. 2.Each character-trait transitions from 0 to 1 only once. 3.No character-trait ever transitions from 1 to 0.

24 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Q: In what sense are phylogenetic trees parsimonious? A: Each character-trait labels exactly 1 edge of the tree. The biological assumptions are: 1.The root represents an ancestor with none of the m character-traits. 2.Each character-trait transitions from 0 to 1 only once. 3.No character-trait ever transitions from 1 to 0.

25 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Q: What character-traits can be used? Morphological features: –(from: http://anthro.palomar.edu/hominid/australo_2.htm)http://anthro.palomar.edu/hominid/australo_2.htm –(Also see: http://www.cfsan.fda.gov/~frf/rfe3pc00.html )http://www.cfsan.fda.gov/~frf/rfe3pc00.html

26 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Q: What character-traits can be used? Morphological features: –Gross anatomical features –OTU-specific esoterica DNA-based characters –specific substring patterns –Specific nucleotides in fixed positions See pages 460 & 461 for more discussion

27 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Defn. perfect phylogeny problem: given the binary matrix M, determine if there is a phylogenetic tree for M, if there is one, build it. We will discuss an O(nm)-time algorithm First we need to preprocess M. –Consider each column as a binary number –msb in row 1 –sort columns in decreasing order. –Let M´ denote the reordered matrix M.

28 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Example.

29 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Defn. for any column k of M´, let O k be the set of objects with a one in column k. Obs. If O j is a proper subset of O k, then column k must be to the left of column j in M´.

30 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Thm. Matrix M´has a phylogenetic tree iff for every pair of columns i, j, either O i and O j are disjoint or one contains the other. Proof. (Sketch starting on next slide)

31 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Proof.  Let T be the phylogenetic tree for M´. Consider characters i, j. Let e j be the edge that character j transitions from 0 to 1. Let e i be the edge that character i transitions from 0 to 1. Objects with character i are below e i in T. Objects with character j are below e j in T.

32 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Proof.  There are 4 possible cases: 1. e i = e j 2. e i is on the path from the root to e j. 3. e j is on the path from the root to e i. 4.The paths diverge before reaching e i or e j. In case 1, O i = O j. In case 2, O j  O i since all objects possessing j possess i. In case 3, O i  O j since all objects possessing i possess j. In case 4, O i  O j = 

33 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Proof.  for all i, j O i & O j are disjoint or one contains the other Consider objects p and q. Let k be the largest character common to both. All characters i < k possessed by p are also possessed by q All characters i < k possessed by q are also possessed by p So they have share exactly the same characters up till k, and none thereafter.

34 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony Proof.  for all i, j O i & O j are disjoint or one contains the other Label each p with the string that is the concatenation of the column numbers for which it has nonzero entries. Likewise for q. Append $ to the string so that no string is a prefix of any other. p & q have a common prefix but diverge after k The keyword tree (sans failure links) for the n objects in M´ specifies a perfect phylogeny for M´.

35 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony O(nm) alg. for the perfect phylogeny problem: 1.Reorder columns of M in descending order using radix sort. Let M´ be the resulting matrix. Label each column by its column position in M´. Q: Why do you think we are using radix sort? A: radix sort is O(nm). Also it can be applied to a number with an arbitrary number of digits.

36 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony 2.For each row p of M´, construct the string consisting of the characters, in sorted (increasing) order, that p possesses. Recall that in step 1 we labeled each character by its column position. The string for a given row will be the concatenation of the column labels for which the row has the value one.

37 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony 3.Build the keyword tree T for the n strings from step 2. Recall that the keyword tree for set P is a rooted directed tree K satisfying: 1.Each edge is labeled with one character 2.Any two edges out of the same node have distinct labels. 3.Every pattern P i in P maps to some node v of K s.t. the path from the root to v spells out P i 4.Every leaf in K is mapped by some pattern in P.

38 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees Example: From textbook P = {potato, poetry, pottery, science, school}

39 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Parsimony 4.Test whether T is a perfect phylogeny for M. Verify that T has exactly n leaves such that: a)Each of the n leaves is labeled by exactly one object. b)Each of the m character-traits labels exactly one edge of T. c)For any object p, the character-traits labeling the edges along the path from the root to p are exactly those character-traits whose state is one.

40 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Suppose you have two different phylogenetic trees. Note: even for the same set of taxa we can derive different trees by basing the comparison on different proteins. Q: How can we determine if they describe a consistent evolutionary history? Q: How can we combine them into a single tree? This section addresses these questions.

41 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Defn. Phylogenetic tree refinement: A phylogenetic tree T is a refinement of T if T can be obtained by a series of contractions of edges of T. Nutshell: T agrees with T, but expresses additional evolutionary history.

42 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Tree refinement: T1 & T2? T1 & T3? T1 & T4? Etc?

43 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Defn. Phylogenetic tree compatibility: Trees T 1 and T 2 are compatible if there exists a phylogenetic tree T 3 refining both T 1 and T 2.

44 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Tree compatibility problem: Given two trees, T 1 and T 2 : determine if they are compatible. if so, return the refinement tree T 3. We will consider a matrix method for finding T 3.

45 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Consider a binary matrix representation of a phylogenetic tree: There is one row for each object (OTU) There is one column for each internal node Entry (i, j) is one iff the leaf for object i is in the subtree rooted at j. Q: Would an example help? A: Ok, then suggest a simple phylogenetic tree.

46 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Let M 1 be the matrix representation of T 1 and similarly M 2 for T 2. Let M 3 be the matrix formed by taking the union of the columns of M 1 and M 2. Q: What is meant by taking the union of columns? A: M 3 will contain: –all columns found only in M 1 –all columns found only in M 2 –One copy of all columns appearing in both M 1 and M 2 –Obviously, columns will have a different order

47 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Q: What should M 3 look like? What about T 3 ?

48 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Q: Do you agree?

49 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Note: In refining T 3 to produce T 4, in M 4 there is no impact wrt to the preceding columns in M 3

50 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Tree Compatibility Theorem: T 1 and T 2 are compatible iff there is a phylogenetic tree for M 3. A phylogenetic tree T 3 for M 3 is a refinement of both T 1 and T 2.

51 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Generalized Perfect Phylogeny Generalization of perfect phylogeny: –Allow multiple states (>2) for character-traits –Label edges with triple (c x y) where: c is the character trait x is the value of the state before the edge y is the value of the state after the edge –The starting state for each character is specified at the root

52 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Generalized Perfect Phylogeny Generalization of perfect phylogeny: continued –The path from the root to the leaf labeled p describes the character traits of the object p. The ending states along this path specify p’s traits –A combination of trait and ending state can appear only once in the tree Example: character c, ending state y There can only be one edge labeled (c ? y) Where “?” Matches any state of c.

53 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Generalized Perfect Phylogeny Time complexity for generalized perfect phylogeny: polynomial in n and m for fixed the number of states NP-complete otherwise


Download ppt "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 17.2-3: Strings and."

Similar presentations


Ads by Google