A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models Russell Almond Florida State University College of Education Educational Psychology.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Other MCMC features in MLwiN and the MLwiN->WinBUGS interface
MCMC estimation in MlwiN
Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Unsupervised Learning
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Bayesian Estimation in MARK
Efficient Cosmological Parameter Estimation with Hamiltonian Monte Carlo Amir Hajian Amir Hajian Cosmo06 – September 25, 2006 Astro-ph/
TRIM Workshop Arco van Strien Wildlife statistics Statistics Netherlands (CBS)
Markov-Chain Monte Carlo
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian statistics – MCMC techniques
BAYESIAN INFERENCE Sampling techniques
1 Vertically Integrated Seismic Analysis Stuart Russell Computer Science Division, UC Berkeley Nimar Arora, Erik Sudderth, Nick Hay.
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Bayes Factor Based on Han and Carlin (2001, JASA).
EM and expected complete log-likelihood Mixture of Experts
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
Priors, Normal Models, Computing Posteriors
2 nd Order CFA Byrne Chapter 5. 2 nd Order Models The idea of a 2 nd order model (sometimes called a bi-factor model) is: – You have some latent variables.
Exam I review Understanding the meaning of the terminology we use. Quick calculations that indicate understanding of the basis of methods. Many of the.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.
Randomized Algorithms for Bayesian Hierarchical Clustering
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
MCMC in practice Start collecting samples after the Markov chain has “mixed”. How do you know if a chain has mixed or not? In general, you can never “proof”
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Mixture Kalman Filters by Rong Chen & Jun Liu Presented by Yusong Miao Dec. 10, 2003.
MCMC reconstruction of the 2 HE cascade events Dmitry Chirkin, UW Madison.
Bayesian Travel Time Reliability
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
1 Getting started with WinBUGS Mei LU Graduate Research Assistant Dept. of Epidemiology, MD Anderson Cancer Center Some material was taken from James and.
Bayesian II Spring Major Issues in Phylogenetic BI Have we reached convergence? If so, do we have a large enough sample of the posterior?
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
+ Chapter 16 – Markov Chain Random Walks McCTRWs – the fast food of random walks.
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
MCMC Output & Metropolis-Hastings Algorithm Part I
Advanced Statistical Computing Fall 2016
ERGM conditional form Much easier to calculate delta (change statistics)
Classification of unlabeled data:
How to handle missing data values
Markov Chain Monte Carlo
Markov chain monte carlo
School of Mathematical Sciences, University of Nottingham.
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ch13 Empirical Methods.
CS 188: Artificial Intelligence Fall 2008
Presentation transcript:

A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models Russell Almond Florida State University College of Education Educational Psychology and Learning Systems BMAW 20141

Cognitive Basis Multiple cognitive processes involved in writing Different processes take different amounts of time – Transcription should be fast – Planning should be long Certain event types should be more common than others BMAW 20142

Multi-Process Model of Writing AERA Deane (2009) Model of writing.

Mixture Models BMAW 20144

Random Data BMAW 20145

Within-Word Pauses BMAW 20146

Mixture of Lognormals Log Pause Time Y ij = log (X ij ) – Student (Level-2 unit) i=1,…,I – Pause (Level-1 unit) j=1,…,J i Z ij ~ cat(  i1,…,  iK ) is an indicator for which of K components the j th pause for the i th student is in Y ij | Z ij = k ~ N(  ik,  ik ) BMAW 20147

Mixture Model BMAW ii ii  ik  ik Y*ijk Z ij y ij Essay i Event j Component k

Mixture Model Problems If  ik is small for some k, then category disappears If  ik is small for some k, then category becomes degenerate If  ik =  ik’ and  ik =  ik’ then really only have K-1 categories BMAW 20149

Labeling Components If we swap the labels on component k and k’, the likelihood is identical Likelihood is multimodal Often put a restriction on the components:  i1 <  i2 < … <  iK Früthwirth-Schattner (2001) notes that when doing MCMC, better to let the chains run freely across the modes and sort out post-hoc Sorting needs to be done before normal MCMC convergence tests, or parameter estimation BMAW

Flittering short series Pilot study was not taken seriously by all students Need 3—5 observations per mixture component to estimate the precision. Some logs were incomplete (only recorded last part of essay writing) May be removing the most disfluent writers Very few events of a certain type may be diagnostic – No between paragraph events – No Editing events BMAW

Key Question How many components? Theory: each component corresponds to a different combination of cognitive processes Rare components might not be identifiable from data Hierarchical models which allow partial pooling across Level-2 (students) might help answer these questions BMAW

Hierarchical Mixture Model BMAW ii ii  ik  ik  0k  0k    0k  0k Y*ijk Z ij y ij Essay i Event j Component k

Problems with hierarchical models If  0k,  0k  0 we get complete pooling If  0k,  0k  ∞ we get no pooling Something similar happens with Need prior distributions that bound us away from those values. log(  0k ), log (  0k ), log (  0k ) ~ N(0,1) BMAW

Two MCMC packages JAGS Random Walk Metropolis, or Gibbs sampling Has a special proposal for normal mixtures Can extend a run if insufficient length Can select which parameters to monitor Stan Hamiltonian Monte Carlo – Cycles take longer – Less autocorrelation Cannot extend runs Must monitor all paramters BMAW

Add redundant parameters to make MCMC faster  ik =  0k +  i  0k –  i ~ N(0,1) log (  ik ) = log (  0k ) +  i  0k –  i ~ N(0,1)  k =  0k  N –  0 ~ Dirichlet(  0m ) –  N ~  2 (2I) BMAW

Initial Values 1.Run EM on each student’s data set to get student-level (Level 1) parameters – If EM does not converge, set parameters to NA 2.Calculate cross-student (Level 2) as summary statistics of Level 1 parameters 3.Impute means for missing Level 1 parameters Repeat with subsets of the data for variety in multiple chains BMAW

Simulated Data Experiment Run initial value routine on real data for K=2,3,4 Generate data from the model using these parameters Fit models with K’=2,3,4 to the data from true K=2,3,4 distributions in both JAGS (RWM) and Stan (HMC) Results shown for K=2, K’=2 BMAW

Results (Mostly K=2, K’=2) All results ( hierMM) All resultshttp://pluto.coe.fsu.edu/mcmc- hierMM Deviance/Log Posterior—Good Mixing Deviance/Log Posterior  01 (average mean of first component)— Slow mixing in JAGS  01  01 (average probability of first component)—Poor convergence, switching in Stan  01  01 (s.d. of log precisions for first component)—Poor convergence, multiple modes?  01 BMAW

Other Results Convergence is still an issue – MCMC finds multiple modes, where EM (without restarts) typically finds only one JAGS (RWM) was about 3 times faster than Stan (HMC) – Monte Carlo se in Stan about 5 times smaller (25 time larger effective sample size) JAGS is still easier to use than Stan Could not use WAIC statistic to recover K – Was the same for K’=2,3,4 BMAW

Implication for Student Essays Does not seem to recover “rare components” Does not offer big advantage over simpler no pooling non-hierarchical model Ignores serial dependence in data – Hidden Markov model might be better than straight mixture BMAW

Try it yourself – Complete source code (R, Stan, JAGS) including data generation – Sample data sets – Output from all of my test runs, including trace plots for all parameters – Slides from this talk. BMAW