PSY 626: Bayesian Statistics for Psychological Science

Slides:

Advertisements

Similar presentations

Introduction to Monte Carlo Markov chain (MCMC) methods

Advertisements

Efficient Cosmological Parameter Estimation with Hamiltonian Monte Carlo Amir Hajian Amir Hajian Cosmo06 – September 25, 2006 Astro-ph/

Markov-Chain Monte Carlo

Bayesian Reasoning: Markov Chain Monte Carlo

Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

Introduction to Monte Carlo Methods D.J.C. Mackay.

1 Statistical Distribution Fitting Dr. Jason Merrick.

A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models Russell Almond Florida State University College of Education Educational Psychology.

Introduction to Sampling Methods Qi Zhao Oct.27,2004.

Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.

Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Markov Chain Monte Carlo in R

Estimating standard error using bootstrap

PSY 626: Bayesian Statistics for Psychological Science

Missing data: Why you should care about it and what to do about it

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Information criterion

MCMC Output & Metropolis-Hastings Algorithm Part I

CHAPTER 10 Comparing Two Populations or Groups

Let’s continue to do a Bayesian analysis

Sampling Distributions

Linear and generalized linear mixed effects models

Let’s do a Bayesian analysis

CHAPTER 10 Comparing Two Populations or Groups

Jun Liu Department of Statistics Stanford University

Sampling Distributions

Markov chain monte carlo

PSY 626: Bayesian Statistics for Psychological Science

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

PSY 626: Bayesian Statistics for Psychological Science

Information criterion

PSY 626: Bayesian Statistics for Psychological Science

PSY 626: Bayesian Statistics for Psychological Science

Bayesian style dependent tests

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

PSY 626: Bayesian Statistics for Psychological Science

Ch13 Empirical Methods.

PSY 626: Bayesian Statistics for Psychological Science

CHAPTER 10 Comparing Two Populations or Groups

Chapter 10: Estimating with Confidence

PSY 626: Bayesian Statistics for Psychological Science

Let’s do a Bayesian analysis

PSY 626: Bayesian Statistics for Psychological Science

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics

Chapter 8: Estimating with Confidence

CHAPTER 10 Comparing Two Populations or Groups

Chapter 8: Estimating with Confidence

CHAPTER 10 Comparing Two Populations or Groups

Chapter 8: Estimating with Confidence

CHAPTER 10 Comparing Two Populations or Groups

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

CHAPTER 10 Comparing Two Populations or Groups

CHAPTER 10 Comparing Two Populations or Groups

Chapter 8: Estimating with Confidence

Mathematical Foundations of BME Reza Shadmehr

CHAPTER 10 Comparing Two Populations or Groups

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

PSY 626: Bayesian Statistics for Psychological Science 11/20/2018 What is STAN doing? Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University PSY200 Cognitive Psychology

Bayes formula We provide the prior and the data, and we want to compute the posterior distribution a is some parameter (e.g., intercept, slope, sigma) x is data (a set of scores) In practice we work with probability density (or likelihoods) rather than probabilities

Bayes formula For any given value of a we can compute most of what we need P(a) is the prior distribution E.g., N(30, 1) Easy to calculate P(x |a) is the likelihood of the data, given a parameter value E.g., prod(dnorm(x, mean=a,sd=1)) P(x) is not so easy to compute, but we can often ignore it

Brute force It might seem that to compute the posterior distribution, we just run through all possible values of a, and compute P(a|x) Indeed, you can sometimes do exactly that!

Challenges In practice you cannot try all values of a Computers cannot represent irrational numbers Cannot go from negative infinity to positive infinity It would take infinite time to consider all the rational numbers In practice, we consider a subset of possible values of a A grid of possible values Use an algorithm that has an adaptive grid size to focus on important/useful values (e.g., maximum likelihood, threshold) How to do this is complicated

Monte Carlo methods A class of algorithms that use pseudorandom numbers to help estimate something What is the area of a circle with a radius of r=5”? We know the answer:

Monte Carlo methods A class of algorithms that use pseudorandom numbers to help estimate something What is the area of a circle with a radius of r=5”? Suppose we randomly assign 20 points to different positions in the surrounding square We know the square has an area 100 Here, 15 points are in the circle That suggests that the area of the circle is close to

Monte Carlo methods This method works for essentially any shape Provided you can determine whether a point is inside or outside the shape (not always easy to do)

Estimating a distribution Suppose there exists some probability density function (distribution) that you want to estimate One approach is to take random draws from the pdf and make a histogram of the resulting values

Estimating a distribution Just taking random draws from the pdf gets complicated, mainly because it is not clear how to define “random” Oftentimes, we can calculate the pdf for any given parameter value, but we do not know what parameter values to consider Guessing or brute force is not likely to work because of the curse of dimensionality The neighborhood around a given volume is larger than the volume itself It gets bigger as the dimensionality gets larger 1/3 1/9 1/27

It takes a process We want a method of sampling from a distribution that reproduces the shape of the distribution (the posterior) We need to sample cleverly, so that the rate at which we sample a given value reflects the probability density of that value Markov Chain: what you sample for your next draw depends (only) on the current state of the system

Markov Chain Here’s a (not so useful) Markov Chain We have a starting value for the mean of a distribution. We set the “current” value to be the starting value. 1) We draw a “new” sample from the distribution with the current mean value 2) We set the current value to be the “new” value 3) Repeat Not useful for us because it does not produce an estimate of the underlying distribution

Metropolis algorithm Pick some starting value, and set it to be the “current” value. Compute the posterior for the current value Adjust the starting value by adding some deviation that is drawn from a normal distribution, check the posterior for the new value Replace the current value with the new value, if a random number between 0 and 1 is less than Note, if the new value has a higher posterior than the posterior of the current value, this ratio is >1, so you will always replace the current value with the new value

Metropolis algorithm What happens as the process unfolds? If you are in a parameter region with low posterior values, you tend to transition to a parameter region with higher posterior values If you are in a parameter region with high posterior values, you tend to stay put But you will move to regions with lower posterior values some of the time In general, the proportion of samples from different regions reflect the posterior values at those regions Thus, your samples end up describing the posterior distribution

Metropolis algorithm Here, a green dot means a new sample is “accepted” You move to a new value A red dot indicates a new new sample is “rejected” You stay put

Sampling It should be clear that some posterior distributions will be easier to estimate than others How different should the “new” sample be from the “current” sample? Too small, and you will miss regions separated by a “gap” of posterior probability Too big, and you waste time sampling from regions with low probability Easy Harder Harderer

Sampling As dimensionality increases, sampling becomes progressively harder Hard to even visualize when dimension is greater than 2 You are always sampling a vector of parameter values E.g., mu1, mu2, sigma, b1, b2, … With a dependent subjects model, you have independent parameters for each subject Easy to have 100’s or 1000’s of dimensions Easy Harder

Metropolis algorithm The posterior density function defines an (inverted) energy surface function The Metropolis (or Metropolis-Hastings) algorithm moves around the energy surface in such a way that it spends more time at low-energy (high probability) regions than at high-energy (low probability) regions Nice simulation at http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html Effect of step size Rejected samples are wastes of time

Methods There are a variety of methods of guiding the search process Clever ways of adapting the size of the differences between current and new choices (many variations, Stan automates their selection) Gibbs sampling: focus on one dimension at a time Stan (Hamiltonian Monte Carlo): treat the parameter vector as a physical item that moves around the space of possible parameter vectors Velocity Momentum Use it to pick a “new” choice for consideration Often faster and more efficient

Hamiltonian MC Each step is more complicated to calculate (generate a trajectory of the “item”): leapfrog steps computes position and momentum But choices tend to be better overall (fewer rejected samples) http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html Effect of slide time Rejected samples are a waste of time (and rare)! NUTS stands for No U-Turn Sampler It is how the HMC plans (stops) the trajectory At the start of each chain we see something like: Gradient evaluation took 0.00093 seconds 1000 transitions using 10 leapfrog steps per transition would take 9.3 seconds. Adjust your expectations accordingly! At the end of each model summary, we see: Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).

Hamiltonian MC Step size is still an issue Stan tries to adjust the step size to fit to the situation But it can fail to do so in some cases (often when the step size is too big and misses some features of the posterior) Gets noticed at the very end (when it is too late to adjust) model4 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3) … Elapsed Time: 1.03518 seconds (Warm-up) 1.88005 seconds (Sampling) 2.91523 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: There were 11969 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 7: Examine the pairs() plot to diagnose sampling problems

Hamiltonian MC Step size is still an issue We can follow the warning’s advice Smaller number of divergent transitions Takes quite a bit longer model5 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3, control = list(adapt_delta = 0.95)) … Elapsed Time: 6.89872 seconds (Warm-up) 3.51857 seconds (Sampling) 10.4173 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: There were 8570 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 7: Examine the pairs() plot to diagnose sampling problems

Hamiltonian MC We can follow the warning’s advice and use an even smaller adapt_delta Smaller number of divergent transitions Takes quite a bit longer New warning related to tree depth (you can set this variable too, but it often does not matter much) model5 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3, control = list(adapt_delta = 0.99)) … Elapsed Time: 8.90083 seconds (Warm-up) 33.6789 seconds (Sampling) 42.5797 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: There were 11180 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 7: There were 3 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 8: Examine the pairs() plot to diagnose sampling problems

Divergence Inconsistent estimates because the steps sizes are too big model3 <- brm(RT_ms ~ Target*NumberDistractors, data = VSdata2, iter = 2000, warmup = 200, chains = 3) launch_shinystan(model3) Launching ShinyStan interface... for large models this may take some time. Loading required package: shiny Listening on http://127.0.0.1:3422

Divergence Model3 (based on a normal distribution) is good Red dots indicate divergence (none here)

Divergence Model5 (based on an exgaussian distribution) looks bad Red dots indicate divergence (lots here)

Divergence Parameterizing the model in a different way sometimes helps In this particular case, fiddling with adapt_delta did not seem to help at all Oftentimes, this is a warning that your model specification is not a good one E.g., the exgaussian may be a poor choice to model this data E.g., your priors are making a good fit nearly impossible E.g., you are including independent variables that are correlated, which makes it difficult to converge on one solution Parameterizing the model in a different way sometimes helps Centering predictors (for example) Setting priors can help I was unable to get a good result for the exgaussian Perhaps because we have such a small data set (one subject)

Sample size Problem goes away if we use the full data set VSdata2<-subset(VSdata, VSdata$DistractorType=="Conjunction") model7 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3) … Elapsed Time: 78.118 seconds (Warm-up) 91.1596 seconds (Sampling) 169.278 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' > model7 Family: exgaussian Links: mu = identity; sigma = identity; beta = identity Formula: RT_ms ~ Target * NumberDistractors Data: VSdata2 (Number of observations: 2720) Samples: 3 chains, each with iter = 8000; warmup = 2000; thin = 1; total post-warmup samples = 18000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 1171.58 24.67 1123.30 1219.96 11870 1.00 TargetPresent -59.42 31.33 -121.29 1.84 12077 1.00 NumberDistractors 16.99 0.71 15.59 18.37 11178 1.00 TargetPresent:NumberDistractors -10.53 0.90 -12.31 -8.76 10916 1.00 Family Specific Parameters: sigma 350.19 12.22 326.73 374.00 13159 1.00 beta 544.63 17.63 510.65 579.21 12275 1.00 Problem goes away if we use the full data set

Priors help Suppose you use a broad (non-informative) prior Uniform distribution Suppose your search process starts in a region with flat posterior values So you always move to a new set of parameter values But you move aimlessly It takes a long time to sample from regions of higher probability

Priors help Even a slightly non-uniform prior helps get the search process moving If the prior moves you in the general direction of the posterior, it can help a lot It could hurt the search in some cases, but then you probably have other problems! Once you get close to where the data likelihood dominates, the prior will hardly matter

Using brms Standard brm call that we have used many times model1 = brm(Leniency ~ SmileType, data = SLdata, iter = 2000, warmup = 200, chains = 3, thin = 2 ) print(summary(model1)) plot(model1) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.36 0.29 4.80 5.93 2620 1.00 SmileTypeFelt -0.46 0.41 -1.25 0.34 2556 1.00 SmileTypeMiserable -0.45 0.40 -1.24 0.32 2672 1.00 SmileTypeNeutral -1.25 0.40 -2.02 -0.44 2433 1.00 Family Specific Parameters: sigma 1.65 0.10 1.46 1.87 2650 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Standard brm call that we have used many times

Number of chains/iterations model2 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 1, thin = 2 ) print(summary(model2)) plot(model2) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 1 chains, each with iter = 200; warmup = 20; thin = 2; total post-warmup samples = 90 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.35 0.31 4.78 5.92 59 0.99 SmileTypeFelt -0.43 0.43 -1.24 0.35 66 0.99 SmileTypeMiserable -0.44 0.42 -1.29 0.39 61 0.99 SmileTypeNeutral -1.27 0.41 -2.18 -0.56 90 0.99 Family Specific Parameters: sigma 1.62 0.10 1.43 1.81 90 0.99 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Let’s break it down to smaller sets so we can understand what is happening

Chains Independent sampling “chains” are combined at the end to make a better estimate Why bother? Parallel processing If your computer has multiple processors, you can split up the chains across processors and get faster results Convergence checking Independent chains should converge to the same posterior distribution, if not then one or more chains may not be sampling well (and you should not trust the estimated posterior to be accurate) Rhat is a check on whether the chains converge

Number of chains/iterations model3 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 2, thin = 2 ) print(summary(model3)) plot(model3) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 2 chains, each with iter = 200; warmup = 20; thin = 2; total post-warmup samples = 180 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.35 0.43 4.70 5.93 87 1.05 SmileTypeFelt -0.46 0.39 -1.23 0.26 123 1.02 SmileTypeMiserable -0.47 0.43 -1.35 0.31 102 1.03 SmileTypeNeutral -1.24 0.43 -2.18 -0.51 102 1.01 Family Specific Parameters: sigma 1.67 0.21 1.46 1.90 123 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Let’s break it down to smaller sets so we can understand what is happening

Rhat Similar to F value in an ANOVA B is the standard deviation between chains W is the standard deviation within chains (pooled across all chains) When This is evidence that the chains have not converged (R=1) does not necessarily mean the chains have converged In practice, we have confidence in convergence if Otherwise, increase number of iterations (or change the model, more informative priors)

Iterations Think of it as similar to sample size for a confidence interval A bigger sample size leads to smaller standard error of the mean, and so a tighter confidence interval How much you need depends on what questions you want to answer. Tails of a distribution are harder to estimate than the center of a distribution Rule of thumb: 4000 iterations to have reasonable accuracy for 2.5th and 97.5th percentiles Might do better with Stan, though

Autocorrelation The nature of MCMC methods is that one takes a new sample “close” to the current sample Over the short-run, this can lead to correlations between samples Makes estimating the posterior imprecise It sorts itself out over the long-run Lots of tools to look for autocorrelation > launch_shinystan(model3) Launching ShinyStan interface... for large models this may take some time. Loading required package: shiny Listening on http://127.0.0.1:3422

Autocorrelation An autocorrelation plot compares correlation to “lag” (number of iterations away) Plot should drop dramatically as lag increases If not, then sampler might be caught in a “loop” Autocorrelation plots look fine for model3

Autocorrelation You want the bars to drop quickly as lag increases This would indicate low autocorrelation Bad Better, but still bad Good

Autocorrelation A lot of your intuitions about sampling come in to play here Ideally, you want independent samples Suppose you were estimating the mean, the standard error of the estimate from the posterior distribution would be Where sigma_p is the standard deviation of the posterior distribution and T is the sample size (number of iterations times number of chains) Correlated samples provide less information about your estimate

Effective Sample Size If there is correlation in the samples, we overestimate the standard error The loss of information from autocorrelation is expressed as an “effective sample size” You want to be sure your T* is big enough to properly estimate your parameters This is a crude measure, so plan accordingly

Thinning Should you have an autocorrelation problem, you can try to mitigate it by “thinning” the samples Just keep every nth sample You can also just increase the number of iterations model5 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 2, thin = 2 ) print(summary(model5)) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 2 chains, each with iter = 200; warmup = 20; thin = 2; total post-warmup samples = 180 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.30 0.33 4.66 5.81 58 1.01 SmileTypeFelt -0.38 0.45 -1.19 0.49 30 1.06 SmileTypeMiserable -0.38 0.43 -1.39 0.48 84 0.99 SmileTypeNeutral -1.20 0.44 -2.10 -0.44 74 1.00 Family Specific Parameters: sigma 1.65 0.10 1.50 1.85 145 1.01 model4 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 2, thin = 1 ) print(summary(model4)) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 2 chains, each with iter = 200; warmup = 20; thin = 1; total post-warmup samples = 360 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.40 0.29 4.78 5.92 129 1.00 SmileTypeFelt -0.51 0.42 -1.34 0.29 144 1.00 SmileTypeMiserable -0.49 0.41 -1.28 0.31 133 1.00 SmileTypeNeutral -1.30 0.41 -2.10 -0.54 128 1.01 Family Specific Parameters: sigma 1.65 0.10 1.45 1.84 360 1.00

Conclusions STAN Monte Carlo Markov Chains Hamiltonian MC Convergence Most of this happens in the background and you do not need to be an expert