Some Developments of ABC David Balding John Molitor David Welch Imperial College London.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
ABC: Bayesian Computation Without Likelihoods David Balding Centre for Biostatistics Imperial College London (
Introduction to Monte Carlo Markov chain (MCMC) methods
MCMC estimation in MlwiN
Lecture 18: Temporal-Difference Learning
Mean, Proportion, CLT Bootstrap
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
METHODS FOR HAPLOTYPE RECONSTRUCTION
Bayesian Estimation in MARK
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Chapter 4: Linear Models for Classification
Statistical inference for epidemics on networks PD O’Neill, T Kypraios (Mathematical Sciences, University of Nottingham) Sep 2011 ICMS, Edinburgh.
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Point estimation, interval estimation
Particle filters (continued…). Recall Particle filters –Track state sequence x i given the measurements ( y 0, y 1, …., y i ) –Non-linear dynamics –Non-linear.
Evaluating Hypotheses
Clustering.
Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,
Today Introduction to MCMC Particle filters and MCMC
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Introduction to Monte Carlo Methods D.J.C. Mackay.
PATTERN RECOGNITION AND MACHINE LEARNING
MCMC: Particle Theory By Marc Sobel. Particle Theory: Can we understand it?
Queensland University of Technology CRICOS No J Towards Likelihood Free Inference Tony Pettitt QUT, Brisbane Joint work with.
Computer vision: models, learning and inference Chapter 19 Temporal models.
Model Inference and Averaging
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ABC The method: practical overview. 1. Applications of ABC in population genetics 2. Motivation for the application of ABC 3. ABC approach 1. Characteristics.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
INTRODUCTION TO Machine Learning 3rd Edition
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Sampling and estimation Petter Mostad
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Tutorial I: Missing Value Analysis
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
A latent Gaussian model for compositional data with structural zeroes Adam Butler & Chris Glasbey Biomathematics & Statistics Scotland.
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Introduction to Sampling based inference and MCMC
MCMC Output & Metropolis-Hastings Algorithm Part I
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 09: BAYESIAN LEARNING
Probabilistic Surrogate Models
Stochastic Methods.
Presentation transcript:

Some Developments of ABC David Balding John Molitor David Welch Imperial College London

Canonical ABC (rejection) Algorithm Simulate parameter value θ* from prior Simulate dataset X* under the model with θ = θ* If X* is sufficiently close to observed data x then retain θ*, otherwise discard. Iterate as required. Retained θ* values form an approximate random sample from the posterior P(θ|X=x).

θ values of these points are treated as random sample from posterior

Features of canonical ABC We get approximate likelihood inferences without calculating the likelihood. Closeness of X* to x is usually assessed via summary statistics, but this isn’t essential. Simulation from prior also isn’t essential. Different definitions of “close” can be tried for the same set of simulations. Simulations can be reused for different observed data.

When to use ABC ? When likelihood is hard to compute because of need for integration over many nuisance parameters BUT easy to simulate –Population genetics: nuisance parameters are the branching times and topology of the genealogical tree underlying the observed DNA sequences/genes. –Epidemic models: nuisance parameters are infection times and infectious periods. Should be regarded as a method of last resort! But it can give reasonable answers when nothing else works.

Population genetics example Parameters: N = effective population size; μ = mutation rate per generation; G = genealogical tree (topology + branch lengths) Summary Statistics: S 1 number of distinct alleles/sequences S 2 number of polymorphic/segregating sites Algorithm: 1. simulate N and μ from joint prior 2. simulate G from the standard coalescent model 3. simulate mutations on G, generate dataset X* and calculate S* 4. accept (N, μ,G) if S* ≈ S Generates a sample from approx. joint posterior of (N,μ,G).

Model comparison via ABC Can use ABC for model comparison, as well as for parameter estimation within models. Ratio of acceptances: approximates the Bayes Factor. Better: fit (weighted) multinomial regression to predict model from observed data. Beaumont (2006) used this to infer the topology of a tree representing the history of 3 Californian fox populations.

Problems/limitations Rejection-ABC is inefficient: most X* are far from x and must be rejected. –No learning. How to find/assess good summary statistics? –Adding summary statistics can make matters worse (see later) How to choose metric for (high-dimensional) S

ABC as a regression problem Accepted (θ*,X*) can be used to approximate the joint density of parameter and data, ( θ,X). Goal is to obtain a “slice” at the observed data. Few simulated points will lie on this slice so need to assume smoothness: required posterior is approximately the same for datasets “close” to that observed. High dimensional densities are hard to estimate; instead obtain regression point estimate of θ given X = x.

Summary statistic, S Parameter,  Prior – p(  ) Marginal likelihood – p(S) Posterior density – p(  | S) Likelihood – p(S |  )

Beaumont, Zhang, and DJB Approximate Bayesian Computation in Population Genetics. Genetics 162: , 2002 Use local-linear regression to adjust for the distance between observed and simulated datasets. Use a smooth (Epanechnikov) weighting according to distance. Can now weaken the “close” criterion (i.e. increase the tolerance) and utilize many more points.

1 0 Summary Statistic Weight Parameter

1 0

Estimation of scaled mutation rate  = 2N  Tolerance Relative mean square error MCMC Standard Rejection With regression adjustment Summary statistics:- mean variance in length mean heterozygosity number of haplotypes i.e. 3 numbers Full data:- 445 Y chromosomes each typed at 8 microsatellite loci i.e numbers

Population growth Population constant size N A until t generations ago, then exponentially rate r per gen. growth to N C. 4 model params, but only 3 identifiable. We choose: Data same as above, except smaller sample size n = 200 (because of time taken for MCMC to converge).

Standard rejection method : Estoup et al. (2002, Genetics)– Demographic history of invasion of islands by cane toads. 10 microsatellite loci, 22 allozyme loci. 4/3 summary statistics, 6 demographic parameters. Estoup and Clegg (2003, Molecular Ecology) – Demographic history of colonisation of islands by silvereyes. With regression adjustment : Tallmon et al (2004, Genetics) – Estimating effective population size by temporal method. One main parameter of interest (Ne), 4 summary statistics. Estoup et al. (2004, Evolution) – Demographic history of invasion of Australia by cane toads. 75/63 summary statistics, model comparison, up to 5 demographic parameters. ABC applications in population genetics:

More sophisticated regressions? Although global linear regression usually gives a poor fit to joint θ/S density, Calabrese (USC, unpublished) uses projection pursuit regression: to fit a large “feature set” of summary statistics. Iterate to improve fit within vicinity of S. Application to estimate human recombination hotspots. Could also consider quantile regression to adapt adjustment to different parts of the distribution.

Do ABC within MCMC Marjoram et al. (2003). Two accept/reject steps: 1.Simulate a dataset at the current parameter values; if it isn’t close to observed data, start again. 2.If it is close, accept or reject according to prior ratio times Hastings ratio (no likelihood ratio) Note: now “close” must be defined in advance; also cannot reuse simulations for different observed datasets. Can apply regression- adjustment to MCMC outputs. Problems: 1.proposals in tree space 2.few acceptances in tail of target distribution - stickiness

Importance sampling within MCMC In fact, the Marjoram et al. MCMC approach can be viewed as a special case of a more general approach developed by Beaumont (2003). Instead of simulating a new dataset forward-in-time, Beaumont used a backward-in-time IS approach to approximate the likelihood. His proof of the validity of the algorithm is readily extended to forwards-in-time approaches based on one or multiple datasets (cf O’Neill et al. 2000). Could also use a regression adjustment.

ABC within Sequential MC Sisson et al UNSW, Sydney Sample initial generation of θ “particles” from prior. Sample θ from previous generation, propose (small) update and generate dataset X*; calculate S*. Repeat until S* ≈ S – BUT tolerance reduces each gen. Calculate prior ratio times Hastings ratio: use as weight W for sampling the next generation. If variance of W is large, resample with replacement according to W and set all W=1/N. Application to estimate parameters of TB infection.

Adaptive ABC algorithm: Jean-Marie Cornuet INRA Montpellier – visiting IC Standard ABC simulation is always from prior – no learning Any simulation density can be used instead of prior –need to weight simulated θ* values by the ratio of prior density to actual simulation density Adaptive approach – divide simulation run into batches; in each batch simulate θ* values from the posterior density obtained from the previous batch.

Wiz-Bang adaptive algorithm John Molitor 1.Generate a batch of B pairs (θ*,X*), simulating θ* from prior. Evaluate a “distance” of each X* from x, say d(X*,x). 2.Obtain a “pseudo-posterior” density estimate using θ* values weighted by the corresponding d(X*,x) values For multivariate parameter vector, update one component at a time, then calculate d(X*,x).

Wiz-Bang algorithm (continued) 3.Continue to simulate θ* but now from density obtained at step 2. Weight each θ* by ratio of prior to simulation density. 4.Continue to update simulation density by retaining the “best” B particles, i.e. those minimising d(X*,x). 5.Continue until convergence is reached: indicated by few updates of “best” particles.

Example – Gaussian Mixture known number of components, K common variance n = 200p = (1/3,1/3,1/3) K = 3μ = (-1/2,0,1/2) σ = 0.2

Data distance Data x consist of real-valued variables Define d(X*,x) to be sum of distances between order statistics of X* and x:

Results - SD

Variable Dimension Mixture Model Now K unknown: update from previous values Examine posterior output (means, sd, etc.) when K=1,K=2, etc. In example below value of K is again 3 but this time not specified in inference:

Variable Dimension Mixture Results

Comparison of methods Which is better, adaptive or sequential? For which problems might we prefer one over the other? Are both of these always better than - canonical rejection ABC ? - rejection ABC with regression adjustment David Welch has written some R code to compare methods in a simple mixtures problem:

Tolerance for d(X*,x) is updated in 7 stages: "effective sample size for population 2 is 25.15" "effective sample size for population 3 is 23.14" "effective sample size for population 4 is 24.07" "effective sample size for population 5 is 22.69" "effective sample size for population 6 is 25.63" "effective sample size for population 7 is 23.29“ Total number of data generation steps: 103,132 Sequential MC

Adaptive algorithm "iteration 2: keeping 0.1 of old samples" "effective sample size is “ "largest distance is “ "iteration 5: keeping of old samples "effective sample size is 33.30" "largest distance is " "iteration 7: keeping of old samples "effective sample size is 15.96" "largest distance is 98.97" "total number of data generation steps 70,000"

Generation steps required: rejection ABC: 306Kfinal sample size 15 Adaptive IS ABC: 70Kfinal ESS = 23 Sequential ABC 103Kfinal ESS = 15 Overall comparisons: Adaptive and sequential approaches perform similarly for this problem, both are superior to standard ABC. Both are easy to code, but sequential requires more setting up (update proposals, schedule for decreasing tolerance).

ABC to “rescue” poor estimators (inspired by DJ Wilson, Lancaster) evaluate estimator based on simplistic model at many datasets simulated under more sophisticated model. for observed dataset, use as estimator regression predictor of simplistic estimator at the observed data value. for example, many population genetics estimators assume no recombination, and infinite sites mutation model –use this estimator and simulations to correct for recombination and finite-sites mutation

Acknowledgments David Welch is funded by an EPSRC grant to further develop ABC ideas and apply particularly in population genomics, in collaboration with Mark Beaumont and Joao Lopes at Reading.