Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.6-8: Multiple Alignment.

Similar presentations


Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.6-8: Multiple Alignment."— Presentation transcript:

1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.6-8: Multiple Alignment Lecturer: Dr. Rose Slides by: Dr. Rose March 5, 2007

2 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Defn. The sum-of-pairs (SP) score of a multiple alignment is the sum of the score of all induced pairs in a global alignment. From the previous example: 1 A A T - G G T T T 2A A - C G T T A T 3T A T C G - A A T SP = 4 + 5 + 4 = 13

3 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Q: What theoretical justification is there for adopting the SP score? Wait for response….. A: None. Or rather none more than for any other multiple alignment scoring scheme. In practice it is a good heuristic and is popular. Q: How can we compute a global alignment M using a minimum sum-of-pairs score? A: Why dynamic programming of course!

4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Assuming that we want to align k strings Q: What time complexity for the DP solution? A:  (n k ), exact SP aligment has been shown to be NP-complete. Q: So what should we do? A: Choose small a k. In practice, the NP-completeness of a problem often does not mean that the sky is falling.

5 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Q: How will k affect the recurrence relation? The recurrence relation for k = 3 is: D(i, j, k) = min[ D(i -1, j - 1, k - 1) + ?, D(i -1, j - 1, k ) + ?, D(i -1, j, k - 1) + ?, D(i, j - 1, k - 1) + ?, D(i -1, j, k ) + ?, D(i, j - 1, k ) + ?, D(i, j, k - 1) + ?]

6 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Let’s consider each term of the recurrence in turn: 1. D(i -1, j - 1, k - 1) is the diagonal cell in all three dimensions. Q: What should be the SP transition cost for D(i-1,j-1,k-1)  D(i, j, k) ? Recall for k = 2, if S 1 (i) = S 2 (j) the cost is the match cost, o/w S 1 (i)  S 2 (j) and we incur the mismatch cost. A: the sum of pairwise match comparisons, i.e., ij, jk, ik.

7 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Let m(i, j) denote the pairwise character match function defined as: m(i, j) = matchCost if the characters match m(i, j) = mismatchCost if the characters mismatch Then the SP transition cost for D(i - 1, j - 1, k - 1)  D(i, j, k) is m(i, j) + m(j, k) + m(i, k) Hence the term cost is : 1. D(i - 1, j - 1, k - 1) + m(i, j) + m(j, k) + m(i, k)

8 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs The next term: 2. D(i -1, j - 1, k ) is the diagonal cell in the first two dimensions. Q: What should be the SP transition cost for D(i-1, j-1, k)  D(i, j, k) ? We have two types of cases to consider: 1.The pairwise diagonal case: i-1, j-1  i, j 2.The two pairwise space insertion cases: i-1, k  i, k and j-1, k  j, k

9 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs The cost will be the sum of the pairwise match and space insertion costs. 1. m(i, j) for (i-1, j-1  i, j) and 2.spacecost for i-1, k  i, k and spacecost for j-1, k  j, k Then the SP transition cost for D(i - 1, j - 1, k)  D(i, j, k) is m(i, j) + 2 * spacecost Hence the term cost is : 2. D(i - 1, j - 1, k) + m(i, j) + 2 * spacecost

10 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Similarly, the third and fourth term costs are: 3. D(i - 1, j, k - 1) + m(i, k) + 2 * spacecost, 4. D(i, j - 1, k - 1) + m(j, k) + 2 * spacecost Note the similarity in the fifth, sixth, and seventh terms: 5. D(i -1, j, k ) + ? 6. D(i, j - 1, k ) + ? 7. D(i, j, k - 1) + ? Q: What should be the cost for transitions from them?

11 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs For D(i -1, j, k) we have two types of cases to consider: 1.The pairwise no change case: j, k  j, k 2.The two pairwise space insertion cases: i-1, j  i, j and i-1, k  i, k Then the SP transition cost for D(i - 1, j, k)  D(i, j, k) is 0 + 2 * spacecost Hence the term cost is : 5. D(i - 1, j, k) + 2 * spacecost

12 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Similarly, the sixth and seventh term costs are: 6. D(i - 1, j, k) + 2 * spacecost, 7. D(i, j, k) + 2 * spacecost Hence D(i, j, k) = min[ D(i -1, j - 1, k - 1) + m(i, j) + m(j, k) + m(i, k), D(i -1, j - 1, k ) + m(i, j) + 2 * spacecost, D(i -1, j, k - 1) + m(i, k) + 2 * spacecost, D(i, j - 1, k - 1) + m(j, k) + 2 * spacecost, D(i -1, j, k ) + 2 * spacecost, D(i, j - 1, k ) + 2 * spacecost, D(i, j, k - 1) + 2 * spacecost]

13 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Q: What about the boundary cells on the 3 faces of the table? 1. D(i, j, 0), 2. D(i, 0, k), 3. D(0, j, k) Observation: Each case degenerates into the familiar two-string alignment distance + space costs for the empty string argument. Approach: represent these cases in terms of pair-wise distance + space costs.

14 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Let D 1,2 (i, j) denote the pairwise distance between S 1 [1..i] and S 2 [1..j]. D 1,3 (i, k) and D 2,3 (j, k) are analogously defined. Consider D(i, j, 0): D(i, j, 0) = D 1,2 (i, j) + ? * spaceCost Q: What is the space cost, i.e., how many spaces? A: i for S 1 and j for S 2 hence: D(i, j, 0) = D 1,2 (i, j) +(i + j) * spaceCost

15 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs By this argument, the boundary cells are given by: 1. D(i, j, 0) = D 1,2 (i, j) + (i + j) * spaceCost, 2. D(i, 0, k) = D 1,3 (i, k) + (i + k) * spaceCost, 3. D(0, j, k) = D 2,3 (j, k) + (j + k) * spaceCost, 4. D(0,0,0) = 0

16 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Speedup Q: How can we speedup our DP approach? A: Use forward dynamic programming. Note: so far we have used backward dynamic programming, i.e., cell (i, j, k) looks back to the seven cells that can influence its value. In contrast: forward DP sends the result of cell (i, j, k) forward to the seven cells whose value it could influence.

17 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Speedup Q: How does this speed things up? A: it doesn’t, if we always send cell (i, j, k)’s value forward. The only significant way to speed up the  (n k ) is to avoid computing all n k cells in the DP table. We will use forward DP to reduce the number of cells that we compute in the DP table.

18 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Speedup Let’s rethink this problem:  View the optimal alignment problem as the shortest path through the weighted edit distance graph.  We are looking for the shortest path from (0,0,0) to (n,n,n).  When node (i, j, k) is computed, we have the shortest path from (0,0,0) to (i, j, k).  The value of node (i, j, k) is sent forward to the seven neighboring nodes that it can influence

19 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Speedup Let w be reached by an outgoing edge from (i, j, k)  the true shortest distance from (0,0,0) to w is the value computed after it has been updated by every node with a ingoing edge to it.  A queue is used to order the nodes for processing.  The final shortest distance for the node v at the head of the queue is set and node v is removed.  Every neighbor w of v is then updated, w is placed in the queue if it is not already there.

20 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Speedup At this point we borrow an A * -like idea: IF (i, j, k) is not on the shortest path from (0,0,0) to (n,n,n) then avoid passing its value forward. More importantly, avoid putting its neighbors, not already in the queue, into the queue. The trick is deciding (i, j, k) is not on the shortest path from (0,0,0) to (n,n,n). Q: How do we pull this rabbit out of our hat?

21 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Speedup Define d 1,2 (i, j) to be the edit distance between suffixes S 1 [i..n] and S 2 [j..n]. Define d 1,3 (i, k) & d 2,3 (j, k), analogously. Note: these edit distances can be computed in O(n 2 ) via DP on the reversed strings. Observation: any shortest path from (i, j, k) to (n,n,n) must have distance at least d 1,2 (i, j) + d 1,3 (i, k) + d 2,3 (j, k)

22 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Speedup Suppose we have an alignment (from somewhere) with an SP distance score z. Core idea: if D(i, j, k) + d 1,2 (i, j) + d 1,3 (i, k) + d 2,3 (j, k) > z, then node (i, j, k) can not be on any shortest path.  Do not pass its value forward.  Do not put its neighbors reached by outgoing edges onto the queue.

23 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Speedup Benefits of being able to prune cell (i, j, k):  We automatically prune many of its descendants.  We don’t process all n k cells in a k-string problem. Big win!!!!  The computation is still exact & will find the optimal alignment.

24 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Sum-of-Pairs Speedup The program called MSA implements the speedup we are discussing. Cold shower:  MSA can align 6 strings with n = ~200  Unlikely to be able to align tens or hundreds of strings. Still, 200 6 cells (= 6.4 * 10 13 cells), otherwise impossible.

25 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Q: Where do we get z from? A: We will use a bounded-error approximation method. Properties of the specific method we will discuss: 1.Polynomial worst-case time complexity 2.The SP-score is less than twice the optimal value.

26 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Idea: focus on alignments consistent with a tree. Q: What do we mean by “consistent with a tree”? Informal explanation: A graph edge denotes a relation between two nodes. Recall that D(S i, S j ) is the optimal weighted distance between S i and S j. We could let D(S i, S j ) be the edge relation.

27 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Informal explanation: A graph edge denotes a relation between two nodes. Recall that D(S i, S j ) is the optimal weighted edit distance between S i and S j. We could let D(S i, S j ) be the edge relation between the node labeled S i and the node labeled S j.

28 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Informal explanation continued: Suppose we have a multiple alignment M. Suppose we construct an unrooted tree from a subset of such edges  between nodes labeled with strings from M. We call the alignment of the strings represented in the tree consistent with the tree.  recall D(S i, S j ) is the edge relation.

29 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Example from text: 3A X X _ Z 1A X _ _ Z 2A _ X _ Z 4A Y _ _ Z 5A Y X X Z

30 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Defn. More formally, let: S be a set of distinct strings. T be an unrooted tree comprised of nodes labeled with strings from set S. M be multiple alignment of the strings in S. M is consistent with T if the induced pairwise alignment of S i and S j has score D(S i, S j ) for each pair of strings (S i, S j ) that label adjacent nodes in T.

31 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Thm. For any set of strings S and for any tree T whose nodes are labeled by distinct strings from set S, we can efficiently find a multiple alignment M (T) of S that is consistent with T. Proof sketch: construct M (T) of S one string at a time. Base case: Pick two strings S i and S j labeling nodes adjacent in T. Create M 2 (T) a two string alignment with distance D(S i,S j ).

32 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Inductive Hypothesis: Assume the theorem holds for 2 < k strings, i.e., M k (T) is consistent with T. Inductive Step: show that the theorem holds for k + 1 strings. Pick a string S j not in M k (T) such that it labels a node adjacent to a node labeled S i already in M k (T). Optimally align S j with S i (S i with spaces in M k (T)). Add S j (S j with spaces) to M k (T) creating M k+1 (T). Look at detailed proof (pg. 348) to see how the issue of inserted spaces is handled.

33 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment By construction: S j and S i have distance D(S i, S j ) M k+1 (T) is consistent with T. By induction, M (T) of S is consistent with T and is efficiently computed.

34 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment We need some more definitions at this point: Defn. the center string S c  S, a set of k strings, is the string that minimizes M =  Sj  S D(S c, S j ). Defn. the center star is a star tree of k nodes, with the center node S c and each of the k-1 remaining nodes labeled by a distinct string in S – S c.

35 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Defn. the multiple alignment M c of strings in S is the multiple alignment consistent with the center star. Defn. let d(S i, S j ) denote the score of the pairwise alignment of strings S j and S i induced by M c. Defn. let d( M ) denote the score of the alignment M. Observations: d(S i, S j )  D(S i, S j ) d( M c ) =  i<j d(S i, S j ).

36 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Defn. the triangle inequality wrt a scoring scheme is defined as the relation s(x, z)  s(x, y) + s(y, z) for any three characters x, y, and z. We can extend the triangle inequality from the scoring scheme for characters to string alignment.

37 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Lemma. If a 2-string scoring scheme that satisfies the triangle inequality is used, then for any S i & S j : d(S i, S j )  d(S i, S c ) + d(S c, S j ) = D(S i, S c ) + D(S c, S j ) Proof sketch: Notice that for each column we have: s(x, z)  s(x, y) + s(y, z) The inequality in the lemma follows immediately. The equality holds since all strings are optimally aligned with S c.

38 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment We can now establish the bounded-error approximation: Defn. Let M * denote the optimal alignment of the k string of S. Defn. Let d * (S i, S j ) denote the pairwise alignment score of the strings S i and S j induced by M *.

39 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded-Error Approximation for SP-Alignment Thm. d( M c )/d( M * )  2(k – 1)/k < 2 See proof on page 350 for details. (basically depends on the previous lemma) Corollary: kM   i<j D(S i, S j )  d( M * )  d( M c )  [2(k – 1)/k]  i<j D(S i, S j ) Recall that M =  Sj  S D(S c, S j ) The alignment score D(S i, S j ) is not based on M c or M * Observation: d( M c )/  i<j D(S i, S j ) gives a measure of the goodness of M c and is guaranteed to be less than 2.

40 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions First fact of consensus representations: There is no consensus as to how to define consensus. Consequently, we will look at several definitions. Steiner consensus strings: Defn. Given a set of string S and a string S´, the consensus error of S´ relative to S is E(S´)=  Sj  S D(S´, S j ). S´ is not required to be a member of S.

41 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Defn. Given a set of strings S, an optimal Steiner string S * for S minimizes the consensus error E(S * ). S * is not required to be a member of S. Observations: in S * we are trying to capture the essential common features in S. Computing E(S * ) appears to be a hard problem.

42 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions No known efficient method for finding S *.  We will consider an approximate method. Lemma: Assume that S contains k strings and that the scoring scheme satisfies the triangle inequality. There exists a string S´  S such that E(S´)/E(S * )  2. Q: What does this lemma say? (Proof sketch next slide)

43 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Proof sketch: For any i, D(S´, S i )  D(S´, S * ) + D(S *, S i ) so,  S´  Sj  S  D(S´, S j ) and  Sj  S  D(S´, S j )   Sj  S* [ D(S´, S * ) + D(S *, S j )] But  Sj  S* [ D(S´, S * ) + D(S *, S j )] = (k-2) D(S´, S * ) + E(S * ) Therefore  S´  (k-2) D(S´, S * ) + E(S * )

44 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Q:Where do we find a good candidate for S´? A: S c, the center string. Recall S c minimizes  Sj  S D(S c, S j ). Thm. E(S c )/E(S * )  2 - 2/k, assuming the scoring scheme satisfies the triangle inequality. Proof. Follows immediately from the previous lemma and the observation that E(S c )  E(S´)

45 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Consensus strings from multiple alignment Defn. Let M be a multiple alignment of strings S, the consensus character of column i of M is the character that minimizes the summed distance to all the characters in column i. Note: the summed distance depends on the pairwise scoring scheme. The plurality character is the consensus character for some scoring schemes.

46 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Defn. Let d(i) denote the minimum sum in column i. Defn. The consensus string S M derived from alignment M is the concatenation of consensus characters for each column of M. Q: How can we evaluate the goodness of S M ? A: One possibility is Goodness(S M ) =  i D(S M, S i ), i.e., see how good of a Steiner string S M is. Consider a different approach…..

47 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Defn. The alignment error of S M, a consensus string containing q characters, is  q i=1 d(i). Defn. The alignment error of M is defined as the - alignment error of S M, its consensus string. Example: 1A A T - G - T T T 2A A - C G T T A T 3T A T C G - A A T A A T C G - T A T Consensus (alignment error of ?)

48 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Defn. The optimal consensus multiple alignment is a multiple alignment M whose consensus string S M has the smallest alignment error over all possible multiple alignments of S.

49 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions The 3 notions of consensus we have discussed are: 1.The Steiner string S * defined from S. 2.The consensus string S M derived from M, with goodness related to its function as a Steiner string. 3.The consensus string S M derived from M, with goodness related to is ability to reflect the column-wise properties of M. Surprisingly (or not) they lead to the same multiple alignment.

50 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Let’s investigate the assertion these concepts result in the same multiple alignment. Let S be a set of k strings. Let T be the star tree with Steiner string S * at the root and each of the k strings of S at distinct leave of T, then: Defn. the multiple alignment consistent with S * is the multiple alignment of S  S * consistent with T.

51 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Thm. Let S denote the consensus string of the optimal consensus multiple alignment. 1.Removing the spaces from S results in the optimal Steiner string S *. 2.Removal of S * from the multiple alignment consistent with S * results in the optimal consensus multiple alignment of S. Proof on page 353.

52 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Q: Why should we care about this theorem? A: The theorem stating: E(S c )/E(S * )  2 - 2/k plus this theorem can be used to approximate the optimal consensus alignment: 1.Find the center string S c. Recall the center string S c  S, a set of k strings, is the string that minimizes M =  Sj  S D(S c, S j ). 2.Place S c at the center of a k node star.

53 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions 3.Label each leaf with a string from S. 4.Construct the multiple alignment M consistent with this tree T. Recall: M is consistent with T if the induced pairwise alignment of S i and S j has score D(S i, S j ) for each pair of strings (S i, S j ) that label adjacent nodes in T.

54 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Consensus Objective Functions Revelation: The multiple alignment M is the same as M c used to approximate the SP objective function. Thm. The multiple alignment M c created by the center star method has: 1.An SP score  (2-2/k) score of the optimal SP alignment. 2.A consensus alignment error  (2-2/k) the alignment error of the optimal consensus multiple alignment.

55 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Phylogenetic tree: a depiction of the evolutionary history of set of taxa. The leaves of the tree are labeled by taxa names. Convention: Each edge (u,v) denotes an ancestor-descendant relation. This relation may be on the basis of morphological attributes or sequence similarity. The internal nodes represent extinct taxa. The leafs represent currently existing taxa.

56 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Two related problems: 1.Problem: find a multiple alignment for a tree a)Given a phylogenetic tree, deduce sequences for the internal nodes to optimize some objective function. b)Find the multiple alignment consistent with the tree. c)Delete the deduced sequences (internal node labels) 2.Find a tree from a set of leaf sequences.(Chapter 17)

57 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Let T be a tree with leaf nodes labeled with distinct strings from a set S. Defn. a phylogenetic alignment for T is an assignment of one string to each internal node. Note: strings labeling internal nodes need not come from S.

58 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Recall that D(S 1, S 2 ) denotes the edit distance between strings S 1 and S 2. Defn. The edge distance of edge (i, j) is D(S i, S j ) where S i and S j are the strings labeling nodes i and j, respectively. Defn. Path distance is the sum of edge distances along the path. Defn. Phylogenetic alignment distance is the sum of all edge distances in the tree.

59 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Phylogenetic alignment problem for T: Find an assignment of strings to internal nodes of T that minimizes the distance of the alignment.

60 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Phylogenetic alignment problem for T: The general problem is too hard (NP-complete). We will consider a heuristic approximate solution.  The solution is within twice the minimal distance.  The approach has polynomial time complexity.

61 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Defn. A lifted alignment is a phylogenetic alignment in which the string assigned to each internal node is also assigned to one of its children. Example:

62 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Lifted Alignment Observation: Each internal node v is labeled by a leaf label appearing in the subtree rooted at v.

63 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Plan: 1.Construct a lifted alignment T L. 2.Initial approach: conceptually transform the optimal phylogenetic alignment. Q: Why do we say “conceptually”? A: Because we don’t have T *, the optimal phylogenetic alignment. 3.Demonstrate property of T L : total distance < twice optimal phylogenetic alignment distance. 4.Next: show how to compute T L efficiently using DP.

64 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Creating T L : Start with input tree T, with leafs labeled by distinct strings. Let T * denote the optimal phylogenetic alignment for T. (This is the assignment of strings to internal nodes of T that minimizes the total of all edge distances.) Successively lift each internal node. An internal node can only be lifted if all of its children have been lifted. Leaf nodes are defined to be lifted.

65 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Q: How do we “lift” a node? Let S * v denote the label of node v in T *. Assume that v’s children have been lifted. WLOG let the labels of v’s children be S 1, S 2,..,S k from S.

66 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Find the string S j among the children that is closest to S * v, i.e., the string S j such that D(S * v, S j )  D(S * v, S i ) for all i from1 to k. Replace S * v,with S j.

67 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Claim: The lifted alignment T L has total distance less or equal to twice that of the optimal phylogenetic alingment T * of T. Sketch of proof: Suppose e(v, w) (v the parent of w) is a nonzero-length edge in T L. Suppose v is labeled S j  S, and w is labeled S i  S. If S j  S i then the distance of e in T L is D(S j, S i )  D(S j, S * v ) + D(S * v, S i ). But D(S j, S * v ) + D(S * v, S i )  2 * D(S * v, S i ) Q: Why is this true? A: because D(S j, S * v )  D(S * v, S i )

68 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Sketch of proof (continued): What about paths? Let P e denote the path from v to the leaf labeled S i in T *. The distance is at most the sum of the edge distances. In T L, if e is a nonzero-length edge, then this path has distance at most twice P e.

69 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment The lifted alignment can be computed with DP. Let T v be the subtree of T rooted at node v. Defn. d(v, S) denotes the distance of the best lifted alignment of T v where v is labeled with S. Obviously, S must be the label of a leaf in T v.

70 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment d(v, S) is computed from the leaves up. 1.The leaves are already considered “lifted”. 2. d(v, S) for a parent of leaves is computed by: d(v, S) =  S´ D(S, S´) where S´ is the label of a child of v. 3.The general recurrence for an internal node is: d(v, S) =  v´  min S´ [D(S, S´) + d(v´, S´) ], where v´ is a child of v and S´ labels a leaf in T v´.

71 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Time analysis: Assume that T has k leaves. Assume that all pairwise distances have been computed. Q: How long does this take? A: O(N 2 ) where N is the total length of all the k strings. Why is this true? How can we explain it?

72 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Phylogenetic trees: Multiple alignment Time analysis: The processing at an internal node is O(k 2 ). Why is this true? Then the total time is O(N 2 + k 3 ). Why O(N 2 + k 3 ) and not O(N 2 + k 2 )? Bottom line: we can compute the optimal lifted alignment in time that is polynomial in the length of the strings and size of the tree.


Download ppt "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.6-8: Multiple Alignment."

Similar presentations


Ads by Google