The Most General Markov Substitution Model on an Unrooted Tree

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

Maximum Likelihood:Phylogeny Estimation Neelima Lingareddy.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Phylogenetic Trees Lecture 4
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
1 Molecular evolution, cont. Estimating rate matrices Lecture 15, Statistics 246 March 11, 2004.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman.
Phylogeny Tree Reconstruction
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Probabilistic methods for phylogenetic trees (Part 2)
Phylogeny Tree Reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Latent Tree Models Part II: Definition and Properties
1 Additive Distances Between DNA Sequences MPI, June 2012.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
A brief introduction to phylogenetics
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Distance-based phylogeny estimation
Phylogenetics LLO9 Maximum Likelihood and Its Applications
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Multiple Alignment and Phylogenetic Trees
Models of Sequence Evolution
Bayesian inference Presented by Amir Hadadi
Goals of Phylogenetic Analysis
Inferring phylogenetic trees: Distance and maximum likelihood methods
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Why Models of Sequence Evolution Matter
BNFO 602 Phylogenetics – maximum likelihood
BNFO 602 Phylogenetics Usman Roshan.
Phylogeny.
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Presentation transcript:

The Most General Markov Substitution Model on an Unrooted Tree Von Bing Yap Department of Statistics and Applied Probability National University of Singapore Acknowledgements: Rongli Zhang, Lior Pachter

Statistical Phylogenetics Neyman (1971): phylogenetics can be formulated as an inference problem, where the unknown parameters are (1) tree topology (primary) (2) evolution process (“nuisance”)

Methods in framework Maximum likelihood Bayesian to a lesser extent Parsimony Distance

Stochastic Models In almost all substitution models used, the transition probabilities are generated from (1) a reversible rate matrix (REV or special case) or (2) a family of reversible rate matrices (Yang and Roberts 1985, Galtier and Gouy 1998)

Special vs General The choice between simple and general model is a trade-off between bias and variance. Simple: smaller variance in estimates, larger bias. General: larger variance in estimates, smaller bias.

The case for general models The tree parameter space, unlike the process parameters, is unchanged, so the bias-variance trade-off maybe less of an issue. It is plausible that general models may get the right tree more often.

Binary Unrooted Trees An (unrooted) tree where all internal nodes are of degree 3. Such a tree arises from a rooted binary tree where the root has degree 2.

General Markov Process Pick any node in an unrooted tree as the “root”. All edges become directed. Parameters: base frequency at root transition probability matrix Pab for the directed edge going from node a to node b.

Alternative Views Picking another node as the root, and appropriate parameter values, can give the same joint distribution at all nodes. This is a Markov random field, which can be viewed as a Markov process in many ways.

Same as before, if diag(π0) P01 = t(P10) diag(π1)

Identifiability Chang (1996): if root base frequencies are all positive, all transition probabilities are invertible and diagonal dominant, then parameters are uniquely determined by the joint distribution at leaf nodes.

Inference Given an unrooted tree, find the most likely process. The loglikelihood is the support for the tree. Choose the tree with the highest support. Likelihood of data computed with Felsenstein’s algorithm.

Parameter Estimation on Fixed Tree Case 1: all node states are observed Case 2: only leaf node states are observed

Case 1: all node states observed Pick a root node. The base frequencies at root are estimated by the observed frequencies. For each directed edge, the transition probability is obtained by dividing the frequency table of changes by the row sums. These are MLEs.

π0 estimated from base composition of sequence 0. Three frequency tables: F01 for going from 0 to 1, F02, and F03. P01 estimated by dividing each row of F01 by sum; etc ….

Case 2: only leaf node states observed Root sequence and all frequency tables are unknown, so are random variables under the model. Let θ0 be a set of parameter estimates. Find conditional expectation of root sequence and frequency tables, given leaf sequences, under θ0.

Put π0, P01, P02 and P03 as determined by θ0. Find conditional expectation of sequence 0 and frequency tables F01, F02 and F03 under these parameters, given the observed data.

Case 2 (continued) Problem reduced to Case 1, with unobserved root sequence and frequency tables replaced by conditional expectations, the imputed data. Applying Case 1 gives a new set of parameter estimates θ1.

EM algorithm Start with an initial parameter estimate θ0. E-step: find conditional expectation of the root sequence and frequency tables, given leaf sequences, under model θ0.

EM algorithm M-step: find MLE θ1 based on the imputed data. Iterate the process to get θ2, θ3 … The likelihoods are guaranteed to increase and converge to a local maximum (Baum 1973, Rubin et al 1977).

Applications Simulated sequences with Jukes-Cantor model Simulated sequences with Felsenstein’s example Phylogeny of human, mouse, dog and chicken

Sidall (1998) In 4-taxon simulation studies using Jukes-Cantor model, parsimony outperforms ML analysis based on Jukes-Cantor and even general reversible (REV) models.

small t, large z t and z are equal and moderate

t = 0.03, z = 0.75 # sites JC REV GEN 100 41% 40% 94% 1,000 52% 50% 61% 10,000 83% 81% 80%

t = 0.15, z = 0.75 # sites JC REV GEN 100 54% 56% 84% 1,000 97% 91% 10,000 100%

t = 0.30, z = 0.75 # sites JC REV GEN 100 73% 69% 87% 1,000 100% 99% 10,000

t = 0.75, z = 0.75 # sites JC REV GEN 100 78% 71% 58% 1,000 100% 10,000

Felsenstein (1983)

π = (1–R R) P1 (row 1) = (1–Q Q) P1 (row 2) = ( 0 1) P2 (row 1) = (1–P P) P2 (row 2) = ( 0 1)

Not a usual model The equilibrium distribution of the transition matrices is (0 1): eventually all states become 1. Still fits in the present framework.

Reconstruction Someone simulates sequences with some R, Q, P values hidden from you. You are asked to reconstruct the tree: 3 possibilities.

Using the usual models, how much information should be put in? Should the branches e have the same length, and f have the same length? Seems unnatural, and may be unfair to some possibilities.

Fit General Model # sites ML Pars 50 80% 39% 100 95% 34% 200 100% 33% Simulation results: percentage accuracy for R = 0.20, Q = 0.10, P = 0.35: # sites ML Pars 50 80% 39% 100 95% 34% 200 100% 33%

Tree 1: human and dog are sister taxa Tree 2: human and mouse are sister taxa

Comparison of Trees Data: ~40,000 4D sites from the CFTR region from Lior Pachter 6 analyses: all, first 1/5, next 1/5,… All analyses support Tree 1 over Tree 2.

Discussion The General Markov process is simple to imagine and to estimate. No explicit branch lengths. Approximate lengths possible. No natural way of incorporating site heterogeneity.