Presentation is loading. Please wait.

Presentation is loading. Please wait.

MCMC efficiency in Multilevel Models

Similar presentations


Presentation on theme: "MCMC efficiency in Multilevel Models"— Presentation transcript:

1 MCMC efficiency in Multilevel Models
William Browne* with Mousa Golalizadeh*, Martin Green and Fiona Steele* Universities of Bristol and Nottingham *Thanks to ESRC for supporting this work

2 Summary Introduction. Application 1 – Clutch size in great tits.
Method 1: Hierarchical centering. Method 2: Parameter expansion. Application 2 – Mastitis incidence in dairy cattle. Method 3: Orthogonal predictors. Application 3 – Contraceptive discontinuation in Indonesia. Conclusions. Further Work.

3 Introduction/Synopsis
MCMC methods allow easy fitting of complex random effects models The simplest (default) MCMC algorithms can produce poorly mixing chains. By reparameterising the model one can greatly improve mixing. These reparameterisations are easy to implement in WinBUGS (or MLwiN) The choice of reparameterisation depends in part on the model/dataset.

4 Application 1: Great tit nesting behaviour (crossed random effects)
Original work was collaborative research with Richard Pettifor (Institute of Zoology, London), and Robin McCleery and Ben Sheldon (University of Oxford).

5 Application 1: Great tit nesting behaviour (crossed random effects)
A longitudinal study of great tits nesting in Wytham Woods, Oxfordshire. 6 responses : 3 continuous & 3 binary. Clutch size, lay date and mean nestling mass. Nest success, male and female survival. Data: 4165 nesting attempts over a period of 34 years. There are 4 higher-level classifications of the data: female parent, male parent, nestbox and year. We only consider Clutch size here

6 Data background The data structure can be summarised as follows:
Source Number of IDs Median #obs Mean Year 34 104 122.5 Nestbox 968 4 4.30 Male parent 2986 1 1.39 Female parent 2944 1.41 Note there is very little information on each individual male and female bird but we can get some estimates of variability via a random effects model.

7 MCMC efficiency for clutch size response
The MCMC algorithm used in the univariate analysis of clutch size was a simple 10-step Gibbs sampling algorithm. . To compare methods for each parameter we can look at the effective sample sizes (ESS) which give an estimate of how many ‘independent samples we have’ for each parameter as opposed to 50,000 dependent samples. ESS = # of iterations/,

8 Effective Sample sizes
Parameter MLwiN WinBUGS Fixed Effect 671 602 Year 30632 29604 Nestbox 833 788 Male 36 33 Female 3098 3685 Observation 110 135 Time 519s 2601s We will now consider methods that will improve the ESS values for particular parameters. We will firstly consider the fixed effect parameter.

9 Trace and autocorrelation plots for fixed effect using standard Gibbs sampling algorithm

10 Hierarchical Centering
This method was devised by Gelfand et al. (1995) for use in nested models. Basically (where feasible) parameters are moved up the hierarchy in a model reformulation. For example: is equivalent to The motivation here is we remove the strong negative correlation between the fixed and random effects by reformulation.

11 Hierarchical Centering
In our cross-classified model we have 4 possible hierarchies up which we can move parameters. We have chosen to move the fixed effect up the year hierarchy as it’s variance had biggest ESS although this choice is rather arbitrary. The ESS for the fixed effect increases 50-fold from 602 to 35,063 while for the year level variance we have a smaller improvement from 29,604 to 34,626. Note this formulation also runs faster 1864s vs 2601s (in WinBUGS).

12 Trace and autocorrelation plots for fixed effect using hierarchical centering formulation

13 Parameter Expansion We next consider the variances and in particular the between-male bird variance. When the posterior distribution of a variance parameter has some mass near zero this can hamper the mixing of the chains for both the variance parameter and the associated random effects. The pictures over the page illustrate such poor mixing. One solution is parameter expansion (Liu et al. 1998). In this method we add an extra parameter to the model to improve mixing.

14 Trace plots for between males variance and a sample male effect using standard Gibbs sampling algorithm

15 Parameter Expansion In our example we use parameter expansion for all 4 hierarchies. Note the  parameters have an impact on both the random effects and their variance. The original parameters can be found by: Note the models are not identical as we now have different prior distributions for the variances.

16 Parameter Expansion For the between males variance we have a 20-fold increase in ESS from 33 to 600. The parameter expanded model has different prior distributions for the variances although these priors are still ‘diffuse’. It should be noted that the point and interval estimate of the level 2 variance has changed from 0.034 (0.002,0.126) to (0.000,0.172). Parameter expansion is computationally slower 3662s vs 2601s for our example.

17 Trace plots for between males variance and a sample male effect using parameter expansion.

18 Combining the two methods
Hierarchical centering and parameter expansion can easily be combined in the same model. Here we perform centering on the year classification and parameter expansion on the other 3 hierarchies.

19 Effective Sample sizes
As we can see below the effective sample sizes for all parameters are improved for this formulation while running time remains approximately the same. Parameter WinBUGS originally WinBUGS combined Fixed Effect 602 34296 Year 29604 34817 Nestbox 788 5170 Male 33 557 Female 3685 8580 Observation 135 1431 Time 2601s 2526s

20 Application 2: Contraceptive discontinuation in Indonesia
Steele et al. (2004) use multilevel multistate models to study transitions in and out of contraceptive use in Indonesia. Here we consider a simplification of their model which considers only the transition from use to non-use, commonly referred to as contraceptive discontinuation. The data come from the 1997 Indonesia Demographic and Health Survey. Contraceptive use histories were collected retrospectively for the six-year period prior to the survey, and include information on the month and year of starting and ceasing use, the method used, and the reason for discontinuation. The analysis is based on 17,833 episodes of contraceptive use for 12,594 women, where an episode is defined as a continuous period of using the same method of contraception. Restructuring the data to discrete-time format with monthly time intervals leads to 365,205 records. To reduce the size, monthly intervals are grouped into six-month intervals and a binomial response is defined with denominator equal to the number of months of use within a six-month interval. Aggregation of intervals leads to a dataset with 68,515 records.

21 Model We here have intervals nested within episodes nested within women. Here we include discrete categories for the duration terms zt – a piecewise constant hazard with categories representing 6-11 months, months, months and >35 months with a base category of 0-5 months.

22 Predictors At the episode level we have:
Age categorized as <25,25-34 and Contraceptive method categorized as pill/injectable, Norplant/IUD, other modern and traditional. At the woman level we have: Education (3 categories). Type of region of residence (urban/rural). Socioeconomic status (low, medium or high).

23 Results The table below gives the effective sample sizes based on runs of 250,000 iterations. Param N Centred Centred α0 1665 28 β4 20517 18820 α1 13405 91 β5 21118 19313 α2 11553 81 β6 2036 50 α3 11900 126 β7 2093 44 α4 10463 161 β8 16965 79 β1 13855 163 β9 6488 33 β2 15933 11840 β10 6876 55 β3 19911 20562 σ2u 14 Here the hierarchical centered formulation does really badly. This is because the cluster variance σ2u is very small: estimates of and for the two methods

24 A closer look at the residuals
It is well known that hierarchical centering works best if the cluster level variance is substantial. Here we see that both the variance is small and the distribution of the residuals is not very normal. This is due to a few women who discontinue usage very quickly and often, whilst many women never discontinue! Std residuals Normal scores

25 Simple logistic regression
We will consider first removing the random effects from the model (due to their small variance) which will result in a simple logistic regression model. It will then be impossible to perform hierarchical centering however we will consider the use of orthogonalisation. Note that Hills and Smith (1992) talk about using orthogonal parameterisations and Roberts and Gilks give it one sentence in ‘MCMC in Practice’. Here we consider it in combination with the simple (single site) random walk Metropolis sampler where reduction of correlation in the posterior is perhaps most important.

26 Method 3:Orthogonal parameterisation
For simplicity assume we have all predictors in one matrix P and that we can write ztα+xtijβ as ptijθ where θ=(α,β). Step 1: Number the predictors in some ordering 1,…,N. Step 2: Take each predictor in turn and replace it with a predictor that is orthogonal to all the already considered predictors. For predictor pk. Note this requires solving k-1 equations in k-1 unknown w parameters. A different orthogonal set of predictors results from each ordering.

27 Orthogonal parameterisation
The second step of the algorithm produces both a set of orthogonal predictors that span the same space as the original predictors and a group of w coefficients that can be combined to form a lower diagonal matrix W. We can fit this model and recover the coefficients for the original predictors by pre-multiplication by WT. It is worth noting here that we use improper Uniform priors for the coefficients and if we used proper priors we would need to also calculate the Jacobian for the reparameterisation to ensure the same priors are used. We ordered the predictors in what follows so that the level 2 predictors were last before performing reparameterisation.

28 Results The following is based on 50,000 iterations: Param Original
Orthogonal α0 403 11578 β4 11306 12010 α1 4249 12049 β5 9877 12023 α2 3617 11878 β6 500 10945 α3 4643 11920 β7 514 11610 α4 5260 11864 β8 5646 10591 β1 5114 12908 β9 1466 11214 β2 7787 10188 β10 1686 10249 β3 10291 8518 Here we see almost universal benefit of the orthogonal parameterisation with virtually zero time costs and very little programming!

29 Combining orthogonalisation with parameter expansion
Combining orthogonalisation and parameter expansion we have: We ran this model using WinBUGS and only 25,000 iterations following a burnin of 500 iterations which took 34 hours compared to 23½ for 250k in MLwiN without parameter expansion. The results are given overleaf.

30 Results for full model Here we compare simply using the orthogonal approach in MLwiN for 250k with both orthogonal predictors and parameter expansion in WinBUGS for 25k. Note this takes ~1.5 times as long: Param Orthog. MLwiN Expan α0 22714 14009 β4 23533 23488 α1 23931 22017 β5 24498 23792 α2 24136 15024 β6 23816 22546 α3 23303 4859 β7 23428 24422 α4 22457 1881 β8 22860 22995 β1 23347 22609 β9 23697 23960 β2 22779 20883 β10 23624 23383 β3 22105 14032 σ2u 20 318

31 Trace plots of variance chain
Here we see the far greater mixing of the variance chain after parameter expansion. It is worth noting that parameter expansion uses a different prior for σ2u and results in an estimate of (0.048) as opposed to (0.006) without and earlier estimates of 0.041(0.026) and (0.018) before orthogonalisation. Note however that all estimates bar parameter expansion are based on very low ESS! Before Parameter Expansion After Parameter Expansion

32 Conclusions Hierarchical centering – as is well known - works well when the cluster variance is big but is no good for small variances Other research building on hierarchical centering includes Papaspiliopoulus et al. (2003,2007) showing how to construct partially non-centred parameterisations. Parameter expansion works well to improve mixing when the cluster variance is small but results in a different prior for the variance. Orthogonalisation of predictors appears to be a good idea generally but is slightly more involved than the other reparameterisations i.e. the predictors need orthogonalising outside WinBUGS and the chains need transforming back. An interesting area of further research is choosing the order for orthogonalisation i.e. which set of orthogonal predictors to use.

33 Current Work E-STAT node in the ESRC funded NCeSS program began in September 2009 including researchers from Bristol, Southampton, Manchester, IOE and Stirling. Software to cater for all types of users including novice practitioners, advanced practitioners and software developers. Based around a new algebraic processing system and MCMC engine written in Python but also includes interoperability with other packages. A series of model templates for specific model families are being developed with the idea being that advanced users develop their own (domain specific) templates. A browser based user interface is being developed. Website: or see me for a demonstration of current version.

34 References Browne, W.J. (2003). MCMC Estimation in MLwiN. London: Institute of Education, University of London Browne, W.J. (2004). An illustration of the use of reparameterisation methods for improving MCMC efficiency in crossed random effect models Multilevel Modelling Newsletter 16 (1): 13-25 Browne, W.J., Steele F., Golalizadeh, M., and Green M.J. (2009) The use of simple reparameterizations to improve the efficiency of Markov chain Monte Carlo estimation for multilevel models with applications to discrete time survival models Journal of Royal Statistical Society, Series A. 172: Gamerman D. (1997) Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing. 7, Gelfand, A.E., Sahu, S.K. and Carlin, B.P. (1995) Efficient parameterisations for normal linear mixed models. Biometrika 83, Gelman, A., Huang, Z., van Dyk, D., and Boscardin, W.J. (2007). Using redundant parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics (to appear). Hills, S.E. and Smith, A.F.M. (1992) Parameterization Issues in Bayesian Inference. In Bayesian Statistics 4, (J M Bernardo, J O Berger, A P Dawid, and A F M Smith, eds), Oxford University Press, UK, pp

35 References cont. Liu, C., Rubin, D.B., and Wu, Y.N. (1998) Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika 85 (4): Liu, J.S., Wu, Y.N. (1999) Parameter Expansion for Data Augmentation. Journal Of The American Statistical Association 94: Papaspiliopoulos, O, Roberts, G.O. and Skold, M. (2003) Non-centred Parameterisations for Hierarchical Models and Data Augmentation. In Bayesian Statistics 7, (J M Bernardo, M J Bayarri, J O Berger, A P Dawid, D Heckerman, A F M Smith and M West, eds), Oxford University Press, UK, pp Papaspiliopoulos, O, Roberts, G.O. and Skold, M. (2007) A General Framework for the Parametrization of Hierarchical Models. Statistical Science 22, Rasbash, J., Browne, W.J., Healy, M, Cameron, B and Charlton, C. (2000). The MLwiN software package version London: Institute of Education, University of London. Steele, F., Goldstein, H. and Browne, W.J. (2004). A general multilevel multistate competing risks model for event history data, with an application to a study of contraceptive use dynamics. Statistical Modelling 4: Van Dyk, D.A., and Meng, X-L. (2001) The Art of Data Augmentation. Journal of Computational and Graphical Statistics. 10,


Download ppt "MCMC efficiency in Multilevel Models"

Similar presentations


Ads by Google