Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,

Slides:



Advertisements
Similar presentations
ABC: Bayesian Computation Without Likelihoods David Balding Centre for Biostatistics Imperial College London (
Advertisements

Introduction to Monte Carlo Markov chain (MCMC) methods
MCMC estimation in MlwiN
Introduction to Haplotype Estimation Stat/Biostat 550.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Some Developments of ABC David Balding John Molitor David Welch Imperial College London.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
METHODS FOR HAPLOTYPE RECONSTRUCTION
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Bayesian Estimation in MARK
Analysis of variance (ANOVA)-the General Linear Model (GLM)
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
Optimal Bandwidth Selection for MLS Surfaces
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
4-1 Statistical Inference The field of statistical inference consists of those methods used to make decisions or draw conclusions about a population.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.
Queensland University of Technology CRICOS No J Towards Likelihood Free Inference Tony Pettitt QUT, Brisbane Joint work with.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,
Model Inference and Averaging
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
A Beginner’s Guide to Bayesian Modelling Peter England, PhD EMB GIRO 2002.
Using Resampling Techniques to Measure the Effectiveness of Providers in Workers’ Compensation Insurance David Speights Senior Research Statistician HNC.
ABC The method: practical overview. 1. Applications of ABC in population genetics 2. Motivation for the application of ABC 3. ABC approach 1. Characteristics.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Bayesian MCMC QTL mapping in outbred mice Andrew Morris, Binnaz Yalcin, Jan Fullerton, Angela Meesaq, Rob Deacon, Nick Rawlins and Jonathan Flint Wellcome.
Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Estimation Method of Moments (MM) Methods of Moment estimation is a general method where equations for estimating parameters are found by equating population.
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
by Ryan P. Adams, Iain Murray, and David J.C. MacKay (ICML 2009)
Multilevel and multifrailty models. Overview  Multifrailty versus multilevel Only one cluster, two frailties in cluster e.g., prognostic index (PI) analysis,
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Univariate Gaussian Case (Cont.)
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
9-1 ESTIMATION Session Factors Affecting Confidence Interval Estimates The factors that determine the width of a confidence interval are: 1.The.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Prediction and Missing Data. Summarising Distributions ● Models are often large and complex ● Often only interested in some parameters – e.g. not so interested.
Inference: Conclusion with Confidence
MCMC Output & Metropolis-Hastings Algorithm Part I
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Ch3: Model Building through Regression
Review of Chapter 11 Comparison of Two Populations
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Predictive distributions
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Classical regression review
Presentation transcript:

Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,

Acknowledgements Wenyang Zhang, University of Kent, David Balding, Imperial College, London Dave Tallmon, Juneau, Alaska Arnaud Estoup, Montpellier BBSRC, NERC

General Problem In population genetics the data we observe have many possible unobservable ‘causes’, which generally follow a hierarchical structure. For example, genetic data depends on some unknown genealogical history, which in turn depends on the mutation model, demographic history, and the effects of selection. These, in turn, depend on the ecology of the organism. Therefore we have many competing explanations for the data and we wish to choose among them. How to do this?

Be pragmatic – take a Bayesian approach Bayesian analysis offers a flexible framework for modelling uncertainty. MCMC has made this possible for population genetic problems.

Problems with MCMC-based methods of genealogical inference Slow – problems of convergence. Difficult to code up. Difficult to modify flexibly to different scenarios. Difficulty addressing the questions that biologists want answered. (Hence the rise of cladistic, network-based methods like NCA.) MCMC is useful, but…

Method for Sampling from Posterior Distribution Consider parameters , data D: Simulate samples  ,D i from the joint density P( ,D): First simulate from the prior  i ~ P(  ) Then simulate from the likelihood D i ~ P(D |  i ) The posterior distribution for any given D can be estimate by the proportion of all simulated points that correspond to that particular D and  divided by the proportion of points corresponding to D (ignoring  ).

Data, D Parameter,  Prior – p(  ) Marginal likelihood – p(D) Posterior distribution – p(  | D) Likelihood – p(D|  )

Replace the data with summary statistics Key Points: For most problems, we can’t hit the data exactly. But similar data may have similar posterior distributions. If we replace the data with summary statistics, then it is easier to decide how ‘similar’ data sets are to each other.

History Tavaré et al. (1997, Genetics) – Specify P(S |  ), [use rejection to] estimate P(  | S). Fu and Li (1997, MBE) – Use S and rejection to estimate posterior distribution of coalescence times (I.e. P(G | S)). Weiss and von Haeseler (1998, Genetics) – use rejection to estimate likelihood P(S |  ). Pritchard et al. (1999, MBE) – use rejection to estimate P( ,G | S). Wall (2000, MBE) – uses rejection to estimate P(S |  ). Beaumont et al. (2002, Genetics) – uses regression/rejection to estimate P(  | S). Marjoram et al. (2003, PNAS) – uses MCMC and rejection to estimate P(  | S).  – Demographic/mutational parameters; S – Summary statistics; G - Genealogy

Beaumont, Zhang, and Balding (2002) Approximate Bayesian Computation in Population Genetics. Genetics 162: This is a problem of density estimation. We want to use information about the relationship between the summary statistics and the parameters in the vicinity of the observed summary statistics. Keep the idea of accepting points close to those observed in the data. Use multiple regression to ‘correct’ for relationship between summary statistics and parameter values. Downweight points further away from the observed values. The idea is that we should be able to accept many more points.

Assume we have observed a d dimensional vector of summary statistics s, and we have n random draws of a (scalar) parameter  1,…,n and corresponding summary statistics S 1,…,n. We scale s and S 1,…,n so that S 1,…,n have unit variance. Local Linear Regression

where Epanechnikov kernel We want to minimize

The solution is where

Our best estimate of the posterior mean is then where e 1 is a d+1 length vector (1,0,…,0).

1 0 Summary Statistic Weight Parameter

Obtaining posterior densities and other summaries using regression approach. We make an assumption that the errors are constant in the interval and adjust the parameter values as The posterior density for  can be approximated as where K  (t) is another Epanechnikov kernel with bandwidth . Alternative, can use some other density estimation method.

Model Comparison As noted in Pritchard et al. (1999), can compare two models, M 1 and M 2 by evaluating the marginal distribution of the summary statistics at s. I.e. Could use original Pritchard method (proportion of points within tolerance window). Alternatively, use multivariate kernel methods to estimate density.

1 0

Example – estimation of  (  in a population with constant size Simulate 100 sets 445 chromosomes, 8 linked microsatellite loci (SMM)  = 10 Summary statistics: mean heterozygosity, mean variance in allele length, number of distinct haplotypes. Rectangular priors (0,50) Point estimate – posterior mean. Also use MCMC (Batwing) to estimate posterior mean (flat prior). Compare Mean Square Error of different methods.

Accuracy in the estimation of scaled mutation rate  = 2N  Tolerance Relative mean square error MCMC Standard Rejection Regression Summary statistics:- mean variance in length mean heterozygosity number of haplotypes Data:- linked microsat loci

Main Conclusion The regression method allows a much larger proportion of points to be used than the rejection method. This means that more summary statistics can be used in the regression method without compromising accuracy.

Generalisations You want to investigate a system which gives rise to genetical and/or ecological data. Construct a (complicated) model (individual-based, stage- structured, genealogical…) that gives rise to the same type of data. Put priors on all the parameters. Decide on the parameters you want to make inferences about. Choose summary statistics. Measure these from your data. Perform simulations. Construct posterior distributions for the parameters of interest, using e.g. the regression methods here.

Pritchard method: Estoup et al. (2002, Genetics)– Demographic history of invasion of islands by cane toads. 10 microsatellite loci, 22 allozyme loci. 4/3 summary statistics, 6 demographic parameters. Estoup and Clegg (2003, Molecular Ecology) – Demographic history of colonisation of islands by silvereyes. Regression method: Tallmon et al (2004, Genetics) – Estimating effective population size by temporal method. One main parameter of interest (Ne), 4 summary statistics, tested on up to Estoup et al. (2004, Evolution, in press) – Demographic history of invasion of Australia by cane toads. 75/63 summary statistics, model comparison, up to 5 demographic parameters. Some Papers using Approximate Bayesian approaches

Coalescent From Tallmon, Luikart, and Beaumont (Genetics, 2004).

Future Work How to choose suitable summary statistics? Need for ‘Data Mining’ techniques. Projection pursuit. Orthogonalisation. Stepwise regression. Because the method is quick, can use e.g. MSE, integrated squared error, coverage etc as an ultimate criterion. Improve conditional density estimation. Improve choice of bandwidth in kernel. Use of transformations (e.g. log-linear modelling). Quantile regression.