Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,

Acknowledgements Wenyang Zhang, University of Kent, David Balding, Imperial College, London Dave Tallmon, Juneau, Alaska Arnaud Estoup, Montpellier BBSRC, NERC

General Problem In population genetics the data we observe have many possible unobservable ‘causes’, which generally follow a hierarchical structure. For example, genetic data depends on some unknown genealogical history, which in turn depends on the mutation model, demographic history, and the effects of selection. These, in turn, depend on the ecology of the organism. Therefore we have many competing explanations for the data and we wish to choose among them. How to do this?

Be pragmatic – take a Bayesian approach Bayesian analysis offers a flexible framework for modelling uncertainty. MCMC has made this possible for population genetic problems.

Problems with MCMC-based methods of genealogical inference Slow – problems of convergence. Difficult to code up. Difficult to modify flexibly to different scenarios. Difficulty addressing the questions that biologists want answered. (Hence the rise of cladistic, network-based methods like NCA.) MCMC is useful, but…

Method for Sampling from Posterior Distribution Consider parameters , data D: Simulate samples  ,D i from the joint density P( ,D): First simulate from the prior  i ~ P(  ) Then simulate from the likelihood D i ~ P(D |  i ) The posterior distribution for any given D can be estimate by the proportion of all simulated points that correspond to that particular D and  divided by the proportion of points corresponding to D (ignoring  ).

Data, D Parameter,  Prior – p(  ) Marginal likelihood – p(D) Posterior distribution – p(  | D) Likelihood – p(D|  )

Replace the data with summary statistics Key Points: For most problems, we can’t hit the data exactly. But similar data may have similar posterior distributions. If we replace the data with summary statistics, then it is easier to decide how ‘similar’ data sets are to each other.

History Tavaré et al. (1997, Genetics) – Specify P(S |  ), [use rejection to] estimate P(  | S). Fu and Li (1997, MBE) – Use S and rejection to estimate posterior distribution of coalescence times (I.e. P(G | S)). Weiss and von Haeseler (1998, Genetics) – use rejection to estimate likelihood P(S |  ). Pritchard et al. (1999, MBE) – use rejection to estimate P( ,G | S). Wall (2000, MBE) – uses rejection to estimate P(S |  ). Beaumont et al. (2002, Genetics) – uses regression/rejection to estimate P(  | S). Marjoram et al. (2003, PNAS) – uses MCMC and rejection to estimate P(  | S).  – Demographic/mutational parameters; S – Summary statistics; G - Genealogy

Beaumont, Zhang, and Balding (2002) Approximate Bayesian Computation in Population Genetics. Genetics 162: 2025-2035. This is a problem of density estimation. We want to use information about the relationship between the summary statistics and the parameters in the vicinity of the observed summary statistics. Keep the idea of accepting points close to those observed in the data. Use multiple regression to ‘correct’ for relationship between summary statistics and parameter values. Downweight points further away from the observed values. The idea is that we should be able to accept many more points.

Assume we have observed a d dimensional vector of summary statistics s, and we have n random draws of a (scalar) parameter  1,…,n and corresponding summary statistics S 1,…,n. We scale s and S 1,…,n so that S 1,…,n have unit variance. Local Linear Regression

where Epanechnikov kernel We want to minimize

The solution is where

Our best estimate of the posterior mean is then where e 1 is a d+1 length vector (1,0,…,0).

1 0 Summary Statistic Weight Parameter

Obtaining posterior densities and other summaries using regression approach. We make an assumption that the errors are constant in the interval and adjust the parameter values as The posterior density for  can be approximated as where K  (t) is another Epanechnikov kernel with bandwidth . Alternative, can use some other density estimation method.

Model Comparison As noted in Pritchard et al. (1999), can compare two models, M 1 and M 2 by evaluating the marginal distribution of the summary statistics at s. I.e. Could use original Pritchard method (proportion of points within tolerance window). Alternatively, use multivariate kernel methods to estimate density.

Example – estimation of  (  in a population with constant size Simulate 100 sets 445 chromosomes, 8 linked microsatellite loci (SMM)  = 10 Summary statistics: mean heterozygosity, mean variance in allele length, number of distinct haplotypes. Rectangular priors (0,50) Point estimate – posterior mean. Also use MCMC (Batwing) to estimate posterior mean (flat prior). Compare Mean Square Error of different methods.

Accuracy in the estimation of scaled mutation rate  = 2N  Tolerance Relative mean square error MCMC Standard Rejection Regression Summary statistics:- mean variance in length mean heterozygosity number of haplotypes Data:- linked microsat loci

Main Conclusion The regression method allows a much larger proportion of points to be used than the rejection method. This means that more summary statistics can be used in the regression method without compromising accuracy.

Generalisations You want to investigate a system which gives rise to genetical and/or ecological data. Construct a (complicated) model (individual-based, stage- structured, genealogical…) that gives rise to the same type of data. Put priors on all the parameters. Decide on the parameters you want to make inferences about. Choose summary statistics. Measure these from your data. Perform simulations. Construct posterior distributions for the parameters of interest, using e.g. the regression methods here.

Pritchard method: Estoup et al. (2002, Genetics)– Demographic history of invasion of islands by cane toads. 10 microsatellite loci, 22 allozyme loci. 4/3 summary statistics, 6 demographic parameters. Estoup and Clegg (2003, Molecular Ecology) – Demographic history of colonisation of islands by silvereyes. Regression method: Tallmon et al (2004, Genetics) – Estimating effective population size by temporal method. One main parameter of interest (Ne), 4 summary statistics, tested on up to Estoup et al. (2004, Evolution, in press) – Demographic history of invasion of Australia by cane toads. 75/63 summary statistics, model comparison, up to 5 demographic parameters. Some Papers using Approximate Bayesian approaches

Coalescent From Tallmon, Luikart, and Beaumont (Genetics, 2004).

Future Work How to choose suitable summary statistics? Need for ‘Data Mining’ techniques. Projection pursuit. Orthogonalisation. Stepwise regression. Because the method is quick, can use e.g. MSE, integrated squared error, coverage etc as an ultimate criterion. Improve conditional density estimation. Improve choice of bandwidth in kernel. Use of transformations (e.g. log-linear modelling). Quantile regression.

Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,

Similar presentations

Presentation on theme: "Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,

Similar presentations

Presentation on theme: "Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,"— Presentation transcript:

Similar presentations

About project

Feedback