Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Most General Markov Substitution Model on an Unrooted Tree

Similar presentations


Presentation on theme: "The Most General Markov Substitution Model on an Unrooted Tree"— Presentation transcript:

1 The Most General Markov Substitution Model on an Unrooted Tree
Von Bing Yap Department of Statistics and Applied Probability National University of Singapore Acknowledgements: Rongli Zhang, Lior Pachter

2 Statistical Phylogenetics
Neyman (1971): phylogenetics can be formulated as an inference problem, where the unknown parameters are (1) tree topology (primary) (2) evolution process (“nuisance”)

3 Methods in framework Maximum likelihood Bayesian to a lesser extent
Parsimony Distance

4 Stochastic Models In almost all substitution models used, the transition probabilities are generated from (1) a reversible rate matrix (REV or special case) or (2) a family of reversible rate matrices (Yang and Roberts 1985, Galtier and Gouy 1998)

5 Special vs General The choice between simple and general model is a trade-off between bias and variance. Simple: smaller variance in estimates, larger bias. General: larger variance in estimates, smaller bias.

6 The case for general models
The tree parameter space, unlike the process parameters, is unchanged, so the bias-variance trade-off maybe less of an issue. It is plausible that general models may get the right tree more often.

7 Binary Unrooted Trees An (unrooted) tree where all internal nodes are of degree 3. Such a tree arises from a rooted binary tree where the root has degree 2.

8

9 General Markov Process
Pick any node in an unrooted tree as the “root”. All edges become directed. Parameters: base frequency at root transition probability matrix Pab for the directed edge going from node a to node b.

10

11 Alternative Views Picking another node as the root, and appropriate parameter values, can give the same joint distribution at all nodes. This is a Markov random field, which can be viewed as a Markov process in many ways.

12 Same as before, if diag(π0) P01 = t(P10) diag(π1)

13 Identifiability Chang (1996): if root base frequencies are all positive, all transition probabilities are invertible and diagonal dominant, then parameters are uniquely determined by the joint distribution at leaf nodes.

14 Inference Given an unrooted tree, find the most likely process. The loglikelihood is the support for the tree. Choose the tree with the highest support. Likelihood of data computed with Felsenstein’s algorithm.

15 Parameter Estimation on Fixed Tree
Case 1: all node states are observed Case 2: only leaf node states are observed

16 Case 1: all node states observed
Pick a root node. The base frequencies at root are estimated by the observed frequencies. For each directed edge, the transition probability is obtained by dividing the frequency table of changes by the row sums. These are MLEs.

17 π0 estimated from base composition of sequence 0. Three frequency tables: F01 for going from 0 to 1, F02, and F03. P01 estimated by dividing each row of F01 by sum; etc ….

18 Case 2: only leaf node states observed
Root sequence and all frequency tables are unknown, so are random variables under the model. Let θ0 be a set of parameter estimates. Find conditional expectation of root sequence and frequency tables, given leaf sequences, under θ0.

19 Put π0, P01, P02 and P03 as determined by θ0. Find conditional expectation of sequence 0 and frequency tables F01, F02 and F03 under these parameters, given the observed data.

20 Case 2 (continued) Problem reduced to Case 1, with unobserved root sequence and frequency tables replaced by conditional expectations, the imputed data. Applying Case 1 gives a new set of parameter estimates θ1.

21 EM algorithm Start with an initial parameter estimate θ0.
E-step: find conditional expectation of the root sequence and frequency tables, given leaf sequences, under model θ0.

22 EM algorithm M-step: find MLE θ1 based on the imputed data.
Iterate the process to get θ2, θ3 … The likelihoods are guaranteed to increase and converge to a local maximum (Baum 1973, Rubin et al 1977).

23 Applications Simulated sequences with Jukes-Cantor model
Simulated sequences with Felsenstein’s example Phylogeny of human, mouse, dog and chicken

24 Sidall (1998) In 4-taxon simulation studies using Jukes-Cantor model, parsimony outperforms ML analysis based on Jukes-Cantor and even general reversible (REV) models.

25 small t, large z t and z are equal and moderate

26 t = 0.03, z = 0.75 # sites JC REV GEN 100 41% 40% 94% 1,000 52% 50%
61% 10,000 83% 81% 80%

27 t = 0.15, z = 0.75 # sites JC REV GEN 100 54% 56% 84% 1,000 97% 91%
10,000 100%

28 t = 0.30, z = 0.75 # sites JC REV GEN 100 73% 69% 87% 1,000 100% 99%
10,000

29 t = 0.75, z = 0.75 # sites JC REV GEN 100 78% 71% 58% 1,000 100%
10,000

30 Felsenstein (1983)

31 π = (1–R R) P1 (row 1) = (1–Q Q) P1 (row 2) = ( ) P2 (row 1) = (1–P P) P2 (row 2) = ( )

32 Not a usual model The equilibrium distribution of the transition matrices is (0 1): eventually all states become 1. Still fits in the present framework.

33 Reconstruction Someone simulates sequences with some R, Q, P values hidden from you. You are asked to reconstruct the tree: 3 possibilities.

34 Using the usual models, how much information
should be put in? Should the branches e have the same length, and f have the same length? Seems unnatural, and may be unfair to some possibilities.

35 Fit General Model # sites ML Pars 50 80% 39% 100 95% 34% 200 100% 33%
Simulation results: percentage accuracy for R = 0.20, Q = 0.10, P = 0.35: # sites ML Pars 50 80% 39% 100 95% 34% 200 100% 33%

36 Tree 1: human and dog are sister taxa Tree 2: human and mouse are sister taxa

37 Comparison of Trees Data: ~40,000 4D sites from the CFTR region from Lior Pachter 6 analyses: all, first 1/5, next 1/5,… All analyses support Tree 1 over Tree 2.

38 Discussion The General Markov process is simple to imagine and to estimate. No explicit branch lengths. Approximate lengths possible. No natural way of incorporating site heterogeneity.


Download ppt "The Most General Markov Substitution Model on an Unrooted Tree"

Similar presentations


Ads by Google