Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau.

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

Mathematical Preliminaries

Variations of the Turing Machine

Angstrom Care 培苗社 Quadratic Equation II

AP STUDY SESSION 2.

STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.

Chapter 7 Sampling and Sampling Distributions

1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.

Chapter 3 Determinants 3.1 The Determinant of a Matrix

Pole Placement.

Break Time Remaining 10:00.

Turing Machines.

Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.

PP Test Review Sections 6-1 to 6-6

1 Atomic Routing Games on Maximum Congestion Costas Busch Department of Computer Science Louisiana State University Collaborators: Rajgopal Kannan, LSU.

LIAL HORNSBY SCHNEIDER

Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.

Bellwork Do the following problem on a ½ sheet of paper and turn in.

Association Rule Mining

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Computer vision: models, learning and inference

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

Joint work with Andre Lieutier Dassault Systemes Domain Theory and Differential Calculus Abbas Edalat Imperial College Oxford.

Graphs, representation, isomorphism, connectivity

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

Artificial Intelligence

6.4 Best Approximation; Least Squares

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.

1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.

12 System of Linear Equations Case Study

Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8

Chapter 8 Estimation Understandable Statistics Ninth Edition

Clock will move after 1 minute

PSSA Preparation.

Select a time to count down from the clock above

16. Mean Square Estimation

Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©

9. Two Functions of Two Random Variables

Impossibility of Consensus in Asynchronous Systems (FLP) Ali Ghodsi – UC Berkeley / KTH alig(at)cs.berkeley.edu.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Distributed Computing 5. Snapshot Shmuel Zaks ©

The Pumping Lemma for CFL’s

Chapter 5 The Mathematics of Diversification

Based on slides by Y. Peng University of Maryland

Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.

Phylogenetic Trees Lecture 4

. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.

. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.

Perfect Phylogeny MLE for Phylogeny Lecture 14

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.

Phylogenetics II.

Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.

Phylogenetic Trees - Parsimony Tutorial #12

Character-Based Phylogeny Reconstruction

CS 581 Tandy Warnow.

Perfect Phylogeny Tutorial #10

Presentation transcript:

Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

2 Maximum Parsimony A Character-based reconstruction method Input: u h sequences (one per species), all of length k. Goal: u Find a tree whose leaves are labeled by the input sequences, and an assignment of sequences to internal nodes, such that the total number of substitutions is minimized.

3 Parsimony score AGA GGA AAA AAG AAA AGA AAA Parsimony score = 3 GGA AAA AGA AAG AAA Parsimony score = 4 The parsimony score of a leaf-labeled tree T is the minimum possible number of mutations over all assignments of sequences to internal vertices of T.

4 Parsimony Based Reconstruction We have here both the small and big problems: 1. The small problem: find the parsimony score for a given leaf labeled tree. 2.The big problem: Find a tree whose leaves are labeled by the input sequences, with the minimum possible parsimony score. 3.We will see efficient algorithms for (1). (2) is hard.

5 Fitch Algorithm: Maximum Parsimony for a Given Tree Input: A rooted binary leaf labeled tree. Output: Most parsimonious assignment of states to internal vertices Work on each position independently. Make one pass from the leaves to the root, and another pass from the root to the leaves. A A/T A A C T A A A/C

6 Fitch’s Algorithm, More detailed  Traverse tree from leaves to root, fix a set of possible states (e.g. nucleotides) for each internal vertex  Traverse tree from root to leaves, pick a unique state for each internal vertex

7 Fitch’s Algorithm – Phase 1  D o a post-order (from leaves to root) traversal of tree, assign to each vertex a set of possible states. Each leaf has a unique possible state, given by the input.  The possible states R i of internal node i with children j and k is given by:

8 Fitch’s Algorithm – Phase 1 Claim (to be proved soon): # of substitutions in optimal solution = # of union operations TC T CT C C T A G C AGC GC

9 Fitch’s Algorithm – Phase 2  do a pre-order (from root to leaves) traversal of tree  The state of the root is an arbitrary r root  R root  The state r j of internal node j with parent i is selected as follows:

10 Fitch’s Algorithm – Phase 2 C T T C C T A G C AG G The algorithm could also select C as the assignment to the root. All other assignments cannot be changed. Complexity: O(nk), where n is the number of leaves and k is the number of states. For m characters the complexity is O(nmk). T C C C

11 Proof of Fitch’s Algorithm We’ll show that Fitch minimizes the parsimony score of the leaf labeled input tree.. u Definitions: l For a leaf-labeled tree T, let T* be an optimal assignment of labels to internal nodes of T. T*(v) be the assignment at internal node. l Let T v be the tree rooted at v.

12 u Claim: Let R i be the set of states kept at the 1 st phase at vertex i. Then s  R i iff there exists an optimal assignment T i * with T i * (i) = s. u Proof: By induction on the tree height h. l Basis: h=1 I.If both children have the same state – zero change. II.Otherwise – exactly one change. AA A AB A  B

13 Induction step: Assume correctness for height h and prove for h+1. Let p 1 and p 2 be the optimal costs of the subtrees of i’s children. If the intersection of i’s children lists is not empty, then the optimal score is p 1 +p 2 and it can be achieved by labeling i with any member in the intersection, and only in this way. Otherwise, the optimal score is p 1 +p 2 +1, and it can be achieved by labeling i with any member in the union of the lists, and only in this way. A,B C,D A,B,C,D A,B B,C B

14 Weighted Maximum Parsimony. Some mutations may be more probable than others. Hence, a natural generalization of the Maximum Parsimony problem is the Weighted Parsimony. You’ll see it in the tutorial.

15 Weighted Parsimony (Sankoff’s algorithm) Weighted Parsimony score: l Input: Tree with characters at the leaves, and a weight function on the mutations: c(a,b) is the weight of the mutation a  b. l Output: assignment of characters to internal vertices which minimizes the total weight of the mutations l The weighted parsimony score reduces to the parsimony score when c(a,a)=0 and c(a,b)=1 for all b other than a.

16 Weighted Parsimony on a Given Tree Each position is independent and computed by itself. Use Dynamic programming. u if i is a node with children j and k, then S(i,a) = min b (S(j,b)+c(a,b)) + min b’ (S(k,b’)+c(a,b’)) i j k S(j,b) S(j,b)  the optimal score of a subtree rooted at j when j has the character b. S(k,b’) S(i,a)

17 Evaluating Parsimony Scores Dynamic programming on a given tree Initialization:  For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration:  For each node with children j and k : S(i,a) = min x (S(j,x)+c(a,x)) + min y (S(k,y)+c(a,y)) Termination:  cost of tree is min x S(r,x) where r is the root Comment: To reconstruct an optimal assignment, we need to keep in each node i and for each character a two characters x, y that minimize the cost when i has character a.

18 Cost of Evaluating Parsimony for binary trees For a tree with n nodes and a single character with k values, the complexity is O(nk 2 ). When there are m such characters, it is O(nmk 2 ).

Is Maximum Parsimony A Reliable Criterion? The motivation for the Perfect Phylogeny and Maximum Parsimony methods comes from models where the characters are “significant”, and hence the number of observed mutations is likely to be as small as possible. When the characters are DNA sequences, common models of evolution assume that mutations are random events. A natural question is whether maximum parsimony is a good method for reconstructing phylogenies in such models. Next we formulate and discuss this question. 19

Probabilistic Models of Evolution A simple (yet quite common) model of evolution, called Jukes Cantor (JC), assumes: 1.Mutations at different “sites” are i.i.d (independent identically distributed). 2.On each edge, all mutations have the same probability. Other models usually assume 1, but give different probabilities to different types of mutations. 20

The JC model: each edge (u,v) corresponds to a probabilistic mutation matrix P uv. u v AGCT A 1-3pppp G p pp C pp p T ppp P uv = p dpeneds on the “length” of the edge 21

A “Model Tree” A model tree in the JC model is an evolution tree which evolves according to the JC model. Formally, it consists of: 1.A directed tree T=(V,E) 2.A distribution of DNA letters at the root. 3.Assignment of JC transition matrices to the edges of T. The JC model (and other common models) assume that the distribution at the root is uniform: Each letter occurs with probability This distribution is preserved in all other vertices of the tree. 22

23 A “model quartet” in the JC model root D C A B Each edge may have a different mutation probability

Consistency of Reconstruction Algorithms A tree reconstruction method (like maximum parsimony) is said to be “consistent” for a probabilistic model of evolution, if the following holds for any phylogenetic tree which fits the model: When the sequences length goes to , the reconstructed tree is w.h.p. the true tree. For the maximum parsimony method, this is equivalent to: The true tree is w.h.p. a most parsimonious tree. 24

25 Of specific interest: reconstructing quartets D C A B Correct reconstruction of (undirected) quartets is equivalent to finding the split defined by the middle edge, (A,B;C,D)

26 Example: Checking Consistency of Maximum Parsimony on Quartet Reconstruction D C A B (500 DNA bases) Phase 1: Simulate evolution on the given quartet

27 DCBA Phase 2: Find a most parsimonious tree for the sequences at the leaves.

28 MP is consistent for the given model tree if w.h.p. the most parsimonious tree gives the correct split As we will see next, Maximum Parsimony is not Consistent for certain quartets Most Parsimonious Tree

29 Consistency Question for Maximum Parsimony Assuming JC model, the consistency question for the Maximum Parsimony method for a given model-tree is the following: Assume that the mutations along the edges occurred by the JC model. Is the true tree likely to have a minimum parsimony score?

30 Inconsistency of Maximum Parsimony Maximum Parsimony is not consistent for the JC and other similar probabilistic models of evolution of DNA. In such models there are some scenarios of evolution, in which the most parsimonious tree is w.h.p. different from the true tree. We illustrate this on quartets. A quartet on 4 species have 3 possible topologies (splits):

31 A quartet which is unlikely to be reconstructed by maximum parsimony A AA Consider the following model quartet, where the probability for a substitution is proportional to edge lengths. In this tree, characters in 2 and 3 are w.h.p. as the origin, and in 1 and 4 are more likely to be different.

32 A AA Parsimony may be useless/misleading for reconstructing the true tree Assume the (likely) scenario where leaves 2 and 3 are the same. There are 4 patterns of substitution for leaves 1,4. A I A A II G C III G G IV G

33 Case I all topologies get same parsimony score A AA AA A A A A A A A A A A A A Score=0

34 Case II all topologies get same score A AA GA A A A G A A A G A G A A Score=1

35 Case III …same A AA GC A A C G A A C G A G C A Score=2

36 Case III most parsimonious topology is wrong A AA CC A A C C A A C C A C C A Score=2 Score=1

37 Parsimony is useful only in the least likely cases A CA AC For most parsimonious tree to be the correct tree, it is necessary that 2 and 3 will have different characters – which is less likely than all other cases

38 Another problem with Maximum Parsimony (and other Character Based Algorithms): Efficiency There are no efficient algorithms for solving the “big” problem for maximum parsimony/Perfect phylogeny (both are known to be NP hard). Mainly for this reason, the most used approaches for solving the big problem are distance based methods.

39 Distance-based Methods for Constructing Phylogenies This approach attempts to overcome the two weaknesses of maximum parsimony: 1. It start by estimating inter-taxa distances from a well defined statistical model of evolution (distances correspond to probability of changes) 2. It provides efficient algorithms for the big problem. Basic idea: The differences between species (usually represented by sequences of characters) are transformed to numerical distances, and an edge weighted tree realizing these distances is constructed.

40 Distance-Based Reconstruction Compute distances between all taxon-pairs Find a tree (edge-weighted) best-describing the distances

41 Distance-based methods for constructing phylogenies Common issues: u Evolutionary model: molecular clocks vs. variable rates of evolution u Algorithms for exact distances: do not handle real data. u Algorithms for noisy distances.

42 Data  Distances  Trees 1. Modeling question: given the data (eg DNA sequences of the taxa), how do we define distances between taxa? 2. Algorithmic question: Decide if the distances define a tree (ultrametric or additive – to be defined later), and if so, construct that tree. 3. In reality, the computed distances are noisy. So we need the algorithm to return a tree which approximates the distances of the input data. In the following we shall study items 2 and 1, and briefly discuss item 3.

43 Ultrametric and Tree Metric A distance metric on a set M of L objects is a function (represented by a symmetric matrix) satisfying: u d(i,i)=0, and for i≠j, d(i,j)>0 u d(i,j)=d(j,i). u For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k). A metric is ultrametric if it corresponds to distances between leaves of a tree which admits molecular clock. It is a tree metric, or additive, if it corresponds to distances between nodes in a weighted tree.

44 1 st model: Molecular Clock  Ultrametric Trees molecular clock assumes a constant rate of evolution. Namely, the distances from any extinct taxon (internal vertex) to all its current descendants are identical. A rooted tree satisfying this property is called ultrametric.

45 Ultrametric trees Definition: An ultrametric tree is a rooted weighted tree all of whose leaves are at the same depth. Basic property: Define the height of the leaves to be 0. Then edge weights can be represented by the heights of internal vertices. AEDCB 8 5 0: Edge weights: Internal-vertices heights: 3 3

46 Least Common Ancestor and distances in Ultrametric Tree Let LCA(i,j) denote the least common ancestor of leaves i and j. Let height(LCA(i, j)) be its distance from the leaves, and dist(i,j) be the distance from i to j. Observation: For any pair of leaves i, j in an ultrametric tree: height(LCA(i,j)) = 0.5 dist(i,j). ABCDE A08853 B0388 C088 D05 E0 AEDCB

47 Ultrametric Matrices Definition: A distance matrix* U of dimension L  L is ultrametric iff for each 3 indices i, j, k : U(i,j) ≤ max {U(i,k),U(j,k)}. jk i96 j9 Theorem: The following conditions are equivalent for an L  L distance matrix U: 1. U is an ultrametric matrix. 2. There is an ultrametric tree with L leaves such that for each pair of leaves i,j: U(i,j) = height(LCA(i,j)) = ½ dist(i,j). * Recall: distance matrix is a symmetric matrix with positive non-diagonal entries, 0 diagonal entries, which satisfies the triangle inequality.

48 Ultrametric tree  Ultrametric matrix There is an ultrametric tree s.t. U(i,j)=½dist(i,j).  U is an ultrametric matrix: By properties of Least Common Ancestors in trees i j k U(k,i) = U(j,i) ≥ U(k,j)

49 Ultrametric matrix  Ultrametric tree: The proof is based on the below two observations: Definition: Let U be an L  L matrix, and let S  {1,...,L}. U[S] is the submatrix of U consisting of the rows and columns with indices from S. Observation 1: U is ultrametric iff for every S  {1,...,L}, U[S] is ultrametric. Observation 2: If U is ultrametric and max i,j U(i,j)=M, then M appears in every row of U. jk i?? jM One of the “?” Must be M

50 Ultrametric matrix  Ultrametric tree: Proof by induction U is an ultrametric matrix  U has an ultrametric tree : By induction on L, the size of U. Basis: L= 1: T is a leaf L= 2: T is a tree with two leaves i j ij i i ii 9 ji

51 Induction step Induction step: L>2. Use the 1 st row to split the set {1,…,L} to two subsets: S 1 ={i: U(1,i) =M}, S 2 ={1,..,L}-S (note: 0<|S i |<L) S 1 ={2,4}, S 2 ={1,3,5}

52 Induction step By Observation 1, U 1= U[S 1 ] and U 2 = U[S 2 ] are ultrametric. Let M 1 (M 2 ) be the maximal entries in U 1 (U 2 resp.). Note that M 1 ≤ M, and M 2 < M (M 2 is the 2 nd largest element in row 1( if M 2 =0 then T 2 is a leaf). By induction there are ultrametric trees T 1 and T 2 for U 1 and U 2. Join T 1 and T 2 to T with a root as shown. T2T2 T1T1 M2M2 M M1M1

53 Proof (end) Need to prove: T is an ultrametric tree for U ie, U(i,j) is the label of the LCA of i and j in T. If i and j are in the same subtree, this holds by induction. Else LCA(i,j) = M (since they are in different subtrees). Also, [U(1,i)= M and U(1,j) ≠ M]  U(i,j) = M. ij Ml iM T2T2 T1T1 M2M2 M M1M1 ij