DIVERSIFYING SELECTION AND FUNCTIONAL CONSTRAINT: ESTIMATING

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Phylogenetic Trees Presenter: Michael Tung
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Probabilistic methods for phylogenetic trees (Part 2)
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
TOWARDS TESTING THE EPIDEMIC CLONE MODEL OF BACTERIAL PATHOGENS Daniel J. Wilson, Gilean A.T. McVean and Martin C.J. Maiden Peter Medawar Building for.
ABC The method: practical overview. 1. Applications of ABC in population genetics 2. Motivation for the application of ABC 3. ABC approach 1. Characteristics.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Latent Class Regression Model Graphical Diagnostics Using an MCMC Estimation Procedure Elizabeth S. Garrett Scott L. Zeger Johns Hopkins University
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
California Pacific Medical Center
Bayesian Travel Time Reliability
Copyright © 2009 Pearson Education, Inc. 8.1 Sampling Distributions LEARNING GOAL Understand the fundamental ideas of sampling distributions and how the.
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Modelling evolution Gil McVean Department of Statistics TC A G.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup.
Markov Chain Monte Carlo in R
Lecture 15 - Hypothesis Testing
DIVERSIFYING SELECTION AND FUNCTIONAL CONSTRAINT
Chapter 8: Estimating with Confidence
From: Phylogenetic Inference via Sequential Monte Carlo
Chapter 8: Estimating with Confidence
Chapter 4 Basic Estimation Techniques
IMa2(Isolation with Migration)
Constrained Hidden Markov Models for Population-based Haplotyping
Model Inference and Averaging
Maximum likelihood (ML) method
Jun Liu Department of Statistics Stanford University
Michael Epstein, Ben Calderhead, Mark A. Girolami, Lucia G. Sivilotti 
Models of Sequence Evolution
Bayesian inference Presented by Amir Hadadi
Hidden Markov Models Part 2: Algorithms
Fiona E. Müllner, Sheyum Syed, Paul R. Selvin, Fred J. Sigworth 
Summary and Recommendations
Ch13 Empirical Methods.
Chapter 8: Estimating with Confidence
John Wakeley, Rasmus Nielsen, Shau Neen Liu-Cordero, Kristin Ardlie 
The Most General Markov Substitution Model on an Unrooted Tree
Chapter 8: Estimating with Confidence
Volume 5, Issue 4, Pages e4 (October 2017)
Chapter 8: Estimating with Confidence
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Summary and Recommendations
Chapter 8: Estimating with Confidence
Applied Statistics and Probability for Engineers
Presentation transcript:

DIVERSIFYING SELECTION AND FUNCTIONAL CONSTRAINT: ESTIMATING THE dN/dS RATIO FOR GENE SEQUENCES IN THE PRESENCE OF RECOMBINATION Daniel J. Wilson and Gil McVean Peter Medawar Building for Pathogen Research and Departments of Statistics, Oxford University Overview A popular way to model natural selection in gene sequences is by way of the dN/dS ratio, which is the relative rate of non-synonymous to synonymous mutations in the evolutionary history of a sample of sequences. Traditionally the dN/dS ratio has been used as a summary statistic in phylogenetics, but the mutation models of Goldman and Yang (1994) and Nielsen and Yang (1998) promoted dN/dS to the status of a parameter, which they called ω. Treating dN/dS as a mutation parameter is a useful way to model natural selection as a form of mutational bias. Values of ω less than one are interpreted as functional constraint, values greater than one can be interpreted as diversifying selection, and ω equal to one corresponds to selective neutrality. Estimating the value of ω for a sample of gene sequences allows inference to be performed on the mode of selection acting on a region, or particular sites within that region. Maximum likelihood methods that estimate ω are in widespread use, and have been applied to many organisms. However, their common assumption of complete linkage along the sequence has been shown to lead to false positives when that assumption is violated by the presence of recombination (Schierup and Hein, 2000; Anisimova et al, 2003; Shriner et al, 2003). We present work-in-progress on a new method that aims, not only to estimate ω in the presence of recombination, but to co-estimate the recombination rate itself. We demonstrate its application in estimating a constant ω along a sequence, and in estimating site-wise ω’s, and discuss briefly the direction we hope to take this research. The model We adopted Nielsen and Yang’s (1998) model of mutation at the codon level, which defines the rate of mutation from codon i to codon j as where πj is the equilibrium frequency of codon j, λ is the rate of synonymous transversion per unit time (we use time units of PNe generations; P is the ploidy and Ne the effective population size), κ is the transition to transversion ratio, and ω is the dN/dS ratio. Following Nielsen and Yang, we treat natural selection as a form of mutational bias which does not perturb the distribution of topology or branch lengths of the genealogical tree from the neutral probability model (which is owing to Kingman (1982) and, with recombination, Hudson (1983), Griffiths and Marjoram (1996)). Parameter estimation depends on evaluation of the likelihood function Pr(D|Θ), where D is the data, Θ = {λ, κ, ω, ρ}, and ρ is the recombination rate. Computational techniques have been developed that calculate the likelihood function by evaluating the likelihood given a tree and numerically integrate over trees, by Importance Sampling or Monte Carlo Markov Chain (MCMC). Such approaches are computationally intensive, extremely time consuming and are currently feasible only for estimation of a small number of parameters. Instead we make use of an approximation to the likelihood function (Li and Stephens, 2003) which we modify to incorporate the Nielsen and Yang mutation model. The algorithm for Li and Stephens’ “PAC” likelihood is extremely fast, and is based on reducing the dependency structure between trees along a sequence down to a Hidden Markov Model (HMM), where mutation is modelled in the emission probability. In our modification, we make the emission probability explicitly dependent on our mutation parameters λ, κ, and ω, so that we can evaluate the likelihood with respect to them. Figure 2 Sampling distributions for ML estimates of λ, κ, ω, ρ for 100 datasets simulated with ω = 0.25 (red histograms) and 100 datasets simulated with ω = 10 (yellow histograms). Each dataset consists of 20 sequences, length 300bp, under fixed λ = 1.0, κ = 6.0, ρ = 2.0 per codon. For λ, κ and ρ, the sampling distributions are largely overlapping, but the ML estimates correctly distinguish between ω = 0.25 and ω = 10. However, for ω = 10 the ML estimates are considerably downwards biased. Estimates of λ appear to be upwardly biased. The means of the sampling distributions for λ and κ are somewhat elevated for ω = 10, suggesting that there is overlap in the information used to estimate λ, κ and ω; this may explain the downwards bias in the ML estimate of ω. Constant ω We have investigated two models for estimating ω from a sample of sequences, the first of which treats ω as constant along the sequence. We used an implementation of Powell’s multi-dimensional optimization algorithm (Press et al, 2002) to obtain maximum likelihood estimates of all four parameters in the constant ω model. To study the properties of the ML estimates we simulated 20 sequences of length 300bp under fixed λ = 1.0, κ = 6.0, ρ = 2.0. One hundred simulations were performed with ω = 0.25 and another 100 with ω = 10. For each simulated data set, the ML estimates were obtained and the resulting sampling distributions for these estimates are illustrated in figure 2. The ML estimates exhibit bias (see figure 2 legend), and the estimates of mutation parameters appear to be sensitive to one another’s values, suggesting that there is some overlap in the information ML utilises. That the ML estimates are biased is perhaps unsurprising, however the parameters are of the correct order of magnitude, and the method appears to be capable of distinguishing between functional constraint and diversifying selection. As well as being biased, the ML estimates do not communicate the gradient of the likelihood surface around those point estimates, and asymptotic results for confidence intervals cannot be reliably applied. For these reasons we have adopted a Bayesian approach to parameter estimation, and we apply this to estimation in the case of variable ω. Figure 1 Example of a log likelihood surface with respect to ω and ρ, using our modification to the PAC likelihood. The data consisted of 50 simulated sequences of length 1000 codons generated under the following parameters: λ = 1.0, κ = 6.0, ω = 0.25, ρ = 2.0. The maximum likelihood estimates of ω and ρ are 0.17 and 1.5 respectively, or 0.25 and 1.5 when the true values of λ and κ are provided. The log likelihood surface is plotted for the true values of λ and κ. Figure 3 Posterior distributions for ω at each of ten sites in a simulated dataset. Green histograms: ω = 0.25. Red histograms: ω = 10. Variable ω To estimate ω in a site-wise manner, we employed an MCMC sampler to generate the posterior distribution of ω at each codon in a simulated dataset. We simulated a 10-codon sequence for n = 50, λ = 1.0, κ = 6.0, ρ = 2.0 and two values of ω. Sites 0-3, 8 and 9 were simulated under functional constraint (ω = 0.25, green histograms in figure 3) while sites 4-7 were simulated under diversifying selection (ω = 10, red histograms). To generate the site-wise posterior distributions of ω, all other parameters were held constant at their true values, and the MCMC sampler was run for 500,000 iterations. We chose an exponential prior on ω with mean 10 (shallow lines in figure 3). On first glance, there is no evidence that the posterior distributions of ω for sites under diversifying selection (red histograms) are higher than for sites under functional constraint (green histograms). However, when generating the data, we recorded the posterior distribution of ω given the true tree at each site, using the pruning algorithm (Felsenstein, 1981). This is illustrated by the other black lines in figure 3. The histograms for the posterior distributions, which use the approximate likelihood to integrate over trees, are fit the by the true posteriors remarkably well. This indicates two things. Firstly, the MCMC sampler and the likelihood are working very well, despite the approximate nature of the likelihood model. Secondly, even knowing the true genealogy cannot help us to distinguish between sites with high and low ω’s. For this simulated dataset, for example, it would be simply impossible to distinguish the codons under diversifying selection from the constrained sites, at least in a site-wise manner. Nevertheless, if contiguous codons share a common ω, we could apply a model in which the sequence is split into several sections, within which codons have the same ω. In a Bayesian framework, we can specify a prior on the number of splits in a given sequence, and using reversible-jump MCMC, we hope to be able to sample from the posterior number and positions of these splits. In this way we hope to improve on our site-wise approach by combining information from contiguous sites, enabling us to estimate ω along sequences that have experienced recombination. The PAC likelihood is based on expanding the likelihood function Pr(D|Θ) into a product of conditional likelihoods Pr(D1|Θ)Pr(D2|D1,Θ)…Pr(Dn|D1,…Dn-1,Θ). PAC approximates Pr(Dk+1|D1,…,Dk,Θ) using an HMM, in which the (k+1)th sequence is a mosaic of the first k sequences, which may be imperfect owing to mutation. At a given site, the HMM latent variable specifies which of the k other sequences the (k+1)th is a copy of. Along the sequence, recombination is modelled by switching the latent variable between the k other sequences. Thus, the (k+1)th sequence can be a copy of different sequences at different sites, but the structure of the HMM makes it likely that contiguous sites are copies from the same sequence. We use the same HMM for recombination along the sequence, but we modify the way that mutation is treated. In particular, we make explicit the idea of a time t to the common ancestor of the (k+1)th sequence and the sequence it copies from. We approximate the distribution of t as an exponential random variate with rate k. At each site, we take the probability (under the Nielsen and Yang model) of observing the codon at the (k+1)th sequence and the codon at the sequence that is copied, given the time t to their common ancestor, and integrate over t. This, the emission probability, is a function of λ, κ, and ω, so we can evaluate the likelihood with respect to these parameters. Figure 1 shows an example of the log-likelihood evaluated using this approximation, plotted against ω and ρ, with λ and κ fixed at their maximum likelihood (ML) point estimates. The data consisted of 20 simulated sequences, of length 300bp, generated with λ = 1.0, κ = 6.0, ω = 0.25, ρ = 2.0 per codon. In the expansion of the likelihood function Pr(D|Θ) into a product of conditional likelihoods, the order in which the sequences are labelled 1,…,n is irrelevant; they are exchangeable. But it the PAC likelihood, because the conditional likelihoods are only approximate, the ordering is relevant. Therefore, when we evaluate the likelihood function, we average over several orderings of the sequences, chosen at random. In the examples presented here, we average over ten orderings. The ten chosen orderings are preserved when evaluating the likelihood at different parameter values. Acknowledgments Thanks go to Stephen Leslie and Jonathan Marchini for help and advice, and to the BBSRC for funding. Cited References Anisimova et al (2003) Genetics 164: 1229-1236 Felsenstein (1981) J Mol Evol 17: 368-376 Goldman and Yang (1994) Mol Biol Evol 11: 725-736 Griffiths and Marjoram (1996) J Comput Biol 3: 479-502 Hudson (1983) Theor Popul Biol 23: 183-201 Kingman (1982) J Appl Probab 19A: 27-43 Li and Stephens (2003) Genetics 165: 2213-2233 Nielsen and Yang (1998) Genetics 148: 929-936 Press et al (2002) Numerical Recipes in C++. CUP Schierup and Hein (2000) Genetics 156: 879-891 Shriner et al (2003) Genet Res 81: 115-121 www.stats.ox.ac.uk/mathgen www.neisseria.org www.medawar.ox.ac.uk