MCMC for multilevel logistic regressions in MLwiN

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

Statistical inference Ian Jolliffe University of Aberdeen CLIPS module 3.4b.

MCMC Estimation for Random Effect Modelling – The MLwiN experience

MCMC Estimation for Random Effect Modelling – The MLwiN experience

MCMC for Poisson response models

Generalised linear mixed models in WinBUGS

Lecture 9 Model Comparison using MCMC and further models.

Introduction to Monte Carlo Markov chain (MCMC) methods

Other MCMC features in MLwiN and the MLwiN->WinBUGS interface

Lecture 23 Spatial Modelling 2 : Multiple membership and CAR models for spatial data.

MCMC estimation in MlwiN

Chapter 7 Sampling and Sampling Distributions

Sampling Distributions

The basics for simulations

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.

Statistical Sampling.

“Students” t-test.

Confidence intervals for means and proportions FETP India

Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.

Multiple Regression and Model Building

January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.

Commonly Used Distributions

Slice Sampling Radford M. Neal The Annals of Statistics (Vol. 31, No. 3, 2003)

Bayesian Estimation in MARK

Sampling: Final and Initial Sample Size Determination

CHAPTER 16 MARKOV CHAIN MONTE CARLO

Bayesian statistics – MCMC techniques

Suggested readings Historical notes Markov chains MCMC details

Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.

Point estimation, interval estimation

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.

Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.

Evaluating Hypotheses

REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y =  0 +  1 x +  | | + | | So far we focused on the regression part –

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Modelling non-independent random effects in multilevel models William Browne Harvey Goldstein University of Bristol.

Priors, Normal Models, Computing Posteriors

Module 1: Statistical Issues in Micro simulation Paul Sousa.

1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.

Tutorial I: Missing Value Analysis

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:

Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.

The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.

SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Markov Chain Monte Carlo in R

Univariate Gaussian Case (Cont.)

MCMC Output & Metropolis-Hastings Algorithm Part I

(5) Notes on the Least Squares Estimate

Bayesian Generalized Product Partition Model

Advanced Statistical Computing Fall 2016

Jun Liu Department of Statistics Stanford University

Markov Chain Monte Carlo

Markov chain monte carlo

School of Mathematical Sciences, University of Nottingham.

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

Ch13 Empirical Methods.

Robust Full Bayesian Learning for Neural Networks

Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics

Learning From Observed Data

Classical regression review

Presentation transcript:

MCMC for multilevel logistic regressions in MLwiN Lecture 14 MCMC for multilevel logistic regressions in MLwiN

Lecture Contents Recap of MCMC from day 3 Recap of logistic regression from days 1 & 4 Metropolis Hastings sampling Reunion Island dataset (2 and 3 level) VPC for binomial models Method Comparison on Rodriguez/Goldman examples

MCMC Methods (recap) Goal: To sample from joint posterior distribution. Problem: For complex models this involves multidimensional integration. Solution: It may be possible to sample from conditional posterior distributions, It can be shown that after convergence such a sampling approach generates dependent samples from the joint posterior distribution.

Gibbs Sampling (recap) When we can sample directly from the conditional posterior distributions then such an algorithm is known as Gibbs Sampling. This proceeds as follows for the variance components example: Firstly give all unknown parameters starting values, Next loop through the following steps:

Gibbs Sampling for VC model Sample from These steps are then repeated with the generated values from this loop replacing the starting values. The chain of values produced by this procedure are known as a Markov chain. Note that β is generated as a block while each uj is updated individually.

Algorithm Summary Repeat the following four steps 1. Generate β from its (Multivariate) Normal conditional distribution. 2. Generate each uj from its Normal conditional distribution. 3. Generate 1/σu2 from its Gamma conditional distribution. 3. Generate 1/σe2 from its Gamma conditional distribution.

Logistic regression model A standard Bayesian logistic regression model (e.g. for the rat tumour example) can be written as follows: Both MLwiN and WinBUGS can fit this model but can we write out the conditional posterior distributions and use Gibbs Sampling?

Conditional distribution for β0 This distribution is not a standard distribution and so we cannot simply simulate from a standard random number generator. However both WinBUGS and MLwiN can fit this model using MCMC. We will in this lecture describe how MLwiN does this before considering WinBUGS in the next lecture.

Metropolis Hastings (MH) sampling An alternative, and more general, way to construct am MCMC sampler. A form of generalised rejection sampling (see later) where values are drawn from approximate distributions and “corrected” so that, asymptotically they behave as random observations from the target distribution. MH sampling algorithms sequentially draw candidate observations from a ‘proposal’ distribution, conditional on the current observations thus inducing a Markov chain.

General MH algorithm step Let us focus on a single parameter θ and its posterior distribution p(θ|Y). Now at iteration t, θ takes value θt and we generate a new value θ* from a proposal distribution q. Then we accept this new value and let θt+1 =θ* with acceptance probability α(θ*, θt) otherwise we set θt+1 = θt. The acceptance probability

Choosing a proposal distribution Remarkably the proposal distribution can have almost any form. There are some (silly) exceptions e.g. a proposal that has point mass at one value but assuming that the proposal allows the chain to explore the whole posterior and doesn’t produce a recurrent chain we are OK. Three special cases of Metropolis Hasting algorithm are: Random walk metropolis sampling. Independence sampling. Gibbs sampling!

Pure Metropolis Sampling The general MH sampling algorithm is due to Hastings (1970) however this is a generalisation of pure Metropolis Sampling (Metropolis et al. (1953)). This special case is when i.e. the proposal distribution is symmetric. This then reduces the acceptance probability to

Random Walk Metropolis This is an example of pure Metropolis sampling. Here q(θ1|θ2) = q(|θ1 – θ2|). Typical examples of random walk proposals are Normal distributions centered around the current value of the parameter i.e. q(θ1|θ2) ~N(θ2,s2) where s2 is the (fixed) proposal variance that can be tuned to give particular acceptance rates. This is the method used within MLwiN.

Independence Sampler The independence sampler is so called as each proposal is independent of the current parameter value i.e. q(θ1|θ2) = q(θ1). This leads to acceptance probability Example proposal distributions could be a Normal based around the ML estimate with inflated variance. The independence sampler can sometimes work very well but equally can work very badly!

Gibbs Sampling The Gibbs sampler that we have already studied is a special case of the MH algorithm. The proposal distribution is the full conditional distribution which leads to acceptance rate 1 as shown below:

MH Sampling in MLwiN MLwiN actually uses a hybrid method. Gibbs sampling steps are used for variance parameters. MH steps are used for fixed effects and residuals. Univariate Normal proposal distributions are used. For the proposal standard deviation a scaled IGLS standard error is initially used (multiplied by 5.8 on the variance scale). However an adaptive method is used to tune these proposal distributions prior to the burn-in period.

MH Sampling for Normal model For the education dataset we can illustrate MH Sampling on the VC model by modifying steps 1 and 2. Repeat the following four steps: 1. Generate βi by Univariate Normal MH Sampling. 2. Generate each uj by Univariate Normal MH Sampling. 3. Generate 1/σu2 from its Gamma conditional distribution. 3. Generate 1/σe2 from its Gamma conditional distribution.

MH Sampling in MLwiN Here we see how to change method. Note for Binomial responses this will change automatically and Gibbs Sampling will not be available.

Trajectories plot Here we see MH sampling for β.

Adaptive Method (ad hoc) One way of finding a ‘good’ proposal distribution is to choose a distribution that gives a particular acceptance rate. It has been shown that a ‘good’ acceptance rate is often around 50%. MLwiN has incorporated an adaptive method that uses this fact to construct univariate Normal proposals with an acceptance rate of approximately 50%. Method Before the burn-in we have an adaptation period where the sampler improves the proposal distribution. The adaptive method requires a desired acceptance rate e.g. 50% and tolerance e.g. 10% resulting in an acceptable range of (40%,60%).

Adaptive method algorithm Run the sampler for consecutive batches of 100 iterations. Compare the number accepted, N with the desired acceptance rate, R. Repeat this procedure until 3 consecutive values of N lie within the acceptable range and then mark this parameter. When all the parameters are marked the adaptation period is over. N.B. Proposal SDs are still modified after being marked until adaptation period is over.

Example of Adaptive period Adaptive method used on parameter, β0: N SD Accepted N in Row 1.0 - 100 0.505 1 200 0.263 4 300 0.138 5 400 0.074 7 500 0.046 19 600 0.032 29 700 0.031 48 800 0.026 40 2 900 0.024 46 3* 1000 0.021 51 1500 0.022

Comparison of Gibbs vs MH MLwiN also has a MVN MH algorithm. A comparison of ESS for a run of 5,000 iterations of the VC model follows: Gibbs MH MH MV β0 216 33 59 β1 4413 973 303 σ2u 2821 2140 2919 σ2e 4712 4895 4728

2 level Reunion Island dataset We have already studied this dataset with a continuous response. Here we consider a subset with only the 1st lactation for each cow resulting in 2 levels: cows nested within herds. The (Binary) response of interest is fscr – whether the first service results in a conception. There are two predictors – ai (whether insemination was natural or artificial) and heifer (the age of the cow).

MCMC algorithm Our MLwiN algorithm has 3 steps: 1. Generate βi by Univariate Normal MH Sampling. 2. Generate each uj by Univariate Normal MH Sampling. 3. Generate 1/σu2 from its Gamma conditional distribution.

MLwin Demo The 2-level model is set up and run in IGLS giving the following starting values:

Trajectories for 5k iterations Here we see some poor mixing particularly for the variance:

DIC for binomial models We can use DIC to check whether we need random effects and whether to include heifer in the model. Note we only ran each model for 5,000 iterations. Model pD DIC VC + AI 18.37 2087.15 (2086.48 50k) LR + AI 2.16 2095.70 VC + AI + Heifer 22.16 2087.43

VC Model after 50k iterations Here is a trace for the herd level variance after 50k iterations that suggests we need to run even longer!

VPC for Binomial models VPC is harder to calculate for Binomial models as the level 1 variance is part of the Binomial distribution and hence related to the mean and on a different scale to higher level variances. Goldstein et al. (2002) propose 4 methods: Use a Taylor series approximation. Use a simulation based approach. Switch to a Normal response model. Use the latent variable approach in Snijders and Bosker.

VPC for Binomial models Snijders and Bosker (1999) suggest the following: The variance of a standard logistic distribution is π2/3 and so the level 1 variance should be replaced by this value. In the Reunion Island example this means Or in other words 2.63% of the variation is at the herd level. The fact that there isn’t a huge level 2 variance may in part explain the poor mixing of the MCMC algorithm.

3 level reunion island dataset We will now fit the same models to the 3-level dataset. After 5,000 we have the following rather worrying results:

Running for longer We ran the chains for 100k after a burn-in of 5k and thinned the chains by a factor of 10. The results are still a little worrying:

Potential solutions In the last 2 slides we have seen some bad mixing behaviour for some parameters. The solution of running for longer seems to work but we need to run for a very long time! In the next lecture we will look at WinBUGS methods for this model and also a reparameterisation of the model known as hierarchical centering. For this model it looks like MCMC is extremely computationally intensive to get reliable estimates. In what follows we look at an example where MCMC is clearly useful.

The Guatemalan Child Health dataset. This consists of a subsample of 2,449 respondents from the 1987 National Survey of Maternal and Child Helath, with a 3-level structure of births within mothers within communities. The subsample consists of all women from the chosen communities who had some form of prenatal care during pregnancy. The response variable is whether this prenatal care was modern (physician or trained nurse) or not. Rodriguez and Goldman (1995) use the structure of this dataset to consider how well quasi-likelihood methods compare with considering the dataset without the multilevel structure and fitting a standard logistic regression. They perform this by constructing simulated datasets based on the original structure but with known true values for the fixed effects and variance parameters. They consider the MQL method and show that the estimates of the fixed effects produced by MQL are worse than the estimates produced by standard logistic regression disregarding the multilevel structure!

The Guatemalan Child Health dataset. Goldstein and Rasbash (1996) consider the same problem but use the PQL method. They show that the results produced by PQL 2nd order estimation are far better than for MQL but still biased. The model in this situation is In this formulation i,j and k index the level 1, 2 and 3 units respectively. The variables x1,x2 and x3 are composite scales at each level because the original model contained many covariates at each level. Browne (1998) considered the hybrid Metropolis-Gibbs method in MLwiN and two possible variance priors (Gamma-1(ε,ε) and Uniform

Simulation Results Parameter (True) MQL1 PQL2 Gamma Uniform β0 (0.65) The following gives point estimates (MCSE) for 4 methods and 500 simulated datasets. Parameter (True) MQL1 PQL2 Gamma Uniform β0 (0.65) 0.474 (0.01) 0.612 (0.01) 0.638 (0.01) 0.655 (0.01) β1 (1.00) 0.741 (0.01) 0.945 (0.01) 0.991 (0.01) 1.015 (0.01) β2 (1.00) 0.753 (0.01) 0.958 (0.01) 1.006 (0.01) 1.031 (0.01) β3 (1.00) 0.727 (0.01) 0.942 (0.01) 0.982 (0.01) 1.007 (0.01) σ2v (1.00) 0.550 (0.01) 0.888 (0.01) 1.023 (0.01) 1.108 (0.01) σ2u (1.00) 0.026 (0.01) 0.568 (0.01) 0.964 (0.02) 1.130 (0.02)

Simulation Results Parameter (True) MQL1 PQL2 Gamma Uniform β0 (0.65) The following gives interval coverage probabilities (90%/95%) for 4 methods and 500 simulated datasets. Parameter (True) MQL1 PQL2 Gamma Uniform β0 (0.65) 67.6/76.8 86.2/92.0 86.8/93.2 88.6/93.6 β1 (1.00) 56.2/68.6 90.4/96.2 92.8/96.4 92.2/96.4 β2 (1.00) 13.2/17.6 84.6/90.8 88.4/92.6 88.6/92.8 β3 (1.00) 59.0/69.6 85.2/89.8 86.2/92.2 σ2v (1.00) 0.6/2.4 70.2/77.6 89.4/94.4 87.8/92.2 σ2u (1.00) 0.0/0.0 21.2/26.8 84.2/88.6 88.0/93.0

Summary of simulations The Bayesian approach yields excellent bias and coverage results. For the fixed effects, MQL performs badly but the other 3 methods all do well. For the random effects, MQL and PQL both perform badly but MCMC with both priors is much better. Note that this is an extreme scenario with small levels 1 in level 2 yet high level 2 variance and in other examples MQL/PQL will not be so bad.

Introduction to Practical In the practical you will be let loose on MLwiN with two datasets: A dataset on contraceptive use in Bangladesh. A veterinary epidemiology dataset on pneumonia in pigs.