PSY 626: Bayesian Statistics for Psychological Science 11/20/2018 What is STAN doing? Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University PSY200 Cognitive Psychology
Bayes formula We provide the prior and the data, and we want to compute the posterior distribution a is some parameter (e.g., intercept, slope, sigma) x is data (a set of scores) In practice we work with probability density (or likelihoods) rather than probabilities
Bayes formula For any given value of a we can compute most of what we need P(a) is the prior distribution E.g., N(30, 1) Easy to calculate P(x |a) is the likelihood of the data, given a parameter value E.g., prod(dnorm(x, mean=a,sd=1)) P(x) is not so easy to compute, but we can often ignore it
Brute force It might seem that to compute the posterior distribution, we just run through all possible values of a, and compute P(a|x) Indeed, you can sometimes do exactly that!
Challenges In practice you cannot try all values of a Computers cannot represent irrational numbers Cannot go from negative infinity to positive infinity It would take infinite time to consider all the rational numbers In practice, we consider a subset of possible values of a A grid of possible values Use an algorithm that has an adaptive grid size to focus on important/useful values (e.g., maximum likelihood, threshold) How to do this is complicated
Monte Carlo methods A class of algorithms that use pseudorandom numbers to help estimate something What is the area of a circle with a radius of r=5”? We know the answer:
Monte Carlo methods A class of algorithms that use pseudorandom numbers to help estimate something What is the area of a circle with a radius of r=5”? Suppose we randomly assign 20 points to different positions in the surrounding square We know the square has an area 100 Here, 15 points are in the circle That suggests that the area of the circle is close to
Monte Carlo methods This method works for essentially any shape Provided you can determine whether a point is inside or outside the shape (not always easy to do)
Estimating a distribution Suppose there exists some probability density function (distribution) that you want to estimate One approach is to take random draws from the pdf and make a histogram of the resulting values
Estimating a distribution Just taking random draws from the pdf gets complicated, mainly because it is not clear how to define “random” Oftentimes, we can calculate the pdf for any given parameter value, but we do not know what parameter values to consider Guessing or brute force is not likely to work because of the curse of dimensionality The neighborhood around a given volume is larger than the volume itself It gets bigger as the dimensionality gets larger 1/3 1/9 1/27
It takes a process We want a method of sampling from a distribution that reproduces the shape of the distribution (the posterior) We need to sample cleverly, so that the rate at which we sample a given value reflects the probability density of that value Markov Chain: what you sample for your next draw depends (only) on the current state of the system
Markov Chain Here’s a (not so useful) Markov Chain We have a starting value for the mean of a distribution. We set the “current” value to be the starting value. 1) We draw a “new” sample from the distribution with the current mean value 2) We set the current value to be the “new” value 3) Repeat Not useful for us because it does not produce an estimate of the underlying distribution
Metropolis algorithm Pick some starting value, and set it to be the “current” value. Compute the posterior for the current value Adjust the starting value by adding some deviation that is drawn from a normal distribution, check the posterior for the new value Replace the current value with the new value, if a random number between 0 and 1 is less than Note, if the new value has a higher posterior than the posterior of the current value, this ratio is >1, so you will always replace the current value with the new value
Metropolis algorithm What happens as the process unfolds? If you are in a parameter region with low posterior values, you tend to transition to a parameter region with higher posterior values If you are in a parameter region with high posterior values, you tend to stay put But you will move to regions with lower posterior values some of the time In general, the proportion of samples from different regions reflect the posterior values at those regions Thus, your samples end up describing the posterior distribution
Metropolis algorithm Here, a green dot means a new sample is “accepted” You move to a new value A red dot indicates a new new sample is “rejected” You stay put
Sampling It should be clear that some posterior distributions will be easier to estimate than others How different should the “new” sample be from the “current” sample? Too small, and you will miss regions separated by a “gap” of posterior probability Too big, and you waste time sampling from regions with low probability Easy Harder Harderer
Sampling As dimensionality increases, sampling becomes progressively harder Hard to even visualize when dimension is greater than 2 You are always sampling a vector of parameter values E.g., mu1, mu2, sigma, b1, b2, … With a dependent subjects model, you have independent parameters for each subject Easy to have 100’s or 1000’s of dimensions Easy Harder
Metropolis algorithm The posterior density function defines an (inverted) energy surface function The Metropolis (or Metropolis-Hastings) algorithm moves around the energy surface in such a way that it spends more time at low-energy (high probability) regions than at high-energy (low probability) regions Nice simulation at http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html Effect of step size Rejected samples are wastes of time
Methods There are a variety of methods of guiding the search process Clever ways of adapting the size of the differences between current and new choices (many variations, Stan automates their selection) Gibbs sampling: focus on one dimension at a time Stan (Hamiltonian Monte Carlo): treat the parameter vector as a physical item that moves around the space of possible parameter vectors Velocity Momentum Use it to pick a “new” choice for consideration Often faster and more efficient
Hamiltonian MC Each step is more complicated to calculate (generate a trajectory of the “item”): leapfrog steps computes position and momentum But choices tend to be better overall (fewer rejected samples) http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html Effect of slide time Rejected samples are a waste of time (and rare)! NUTS stands for No U-Turn Sampler It is how the HMC plans (stops) the trajectory At the start of each chain we see something like: Gradient evaluation took 0.00093 seconds 1000 transitions using 10 leapfrog steps per transition would take 9.3 seconds. Adjust your expectations accordingly! At the end of each model summary, we see: Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).
Hamiltonian MC Step size is still an issue Stan tries to adjust the step size to fit to the situation But it can fail to do so in some cases (often when the step size is too big and misses some features of the posterior) Gets noticed at the very end (when it is too late to adjust) model4 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3) … Elapsed Time: 1.03518 seconds (Warm-up) 1.88005 seconds (Sampling) 2.91523 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: There were 11969 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 7: Examine the pairs() plot to diagnose sampling problems
Hamiltonian MC Step size is still an issue We can follow the warning’s advice Smaller number of divergent transitions Takes quite a bit longer model5 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3, control = list(adapt_delta = 0.95)) … Elapsed Time: 6.89872 seconds (Warm-up) 3.51857 seconds (Sampling) 10.4173 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: There were 8570 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 7: Examine the pairs() plot to diagnose sampling problems
Hamiltonian MC We can follow the warning’s advice and use an even smaller adapt_delta Smaller number of divergent transitions Takes quite a bit longer New warning related to tree depth (you can set this variable too, but it often does not matter much) model5 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3, control = list(adapt_delta = 0.99)) … Elapsed Time: 8.90083 seconds (Warm-up) 33.6789 seconds (Sampling) 42.5797 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: There were 11180 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 7: There were 3 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 8: Examine the pairs() plot to diagnose sampling problems
Divergence Inconsistent estimates because the steps sizes are too big model3 <- brm(RT_ms ~ Target*NumberDistractors, data = VSdata2, iter = 2000, warmup = 200, chains = 3) launch_shinystan(model3) Launching ShinyStan interface... for large models this may take some time. Loading required package: shiny Listening on http://127.0.0.1:3422
Divergence Model3 (based on a normal distribution) is good Red dots indicate divergence (none here)
Divergence Model5 (based on an exgaussian distribution) looks bad Red dots indicate divergence (lots here)
Divergence Parameterizing the model in a different way sometimes helps In this particular case, fiddling with adapt_delta did not seem to help at all Oftentimes, this is a warning that your model specification is not a good one E.g., the exgaussian may be a poor choice to model this data E.g., your priors are making a good fit nearly impossible E.g., you are including independent variables that are correlated, which makes it difficult to converge on one solution Parameterizing the model in a different way sometimes helps Centering predictors (for example) Setting priors can help I was unable to get a good result for the exgaussian Perhaps because we have such a small data set (one subject)
Sample size Problem goes away if we use the full data set VSdata2<-subset(VSdata, VSdata$DistractorType=="Conjunction") model7 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3) … Elapsed Time: 78.118 seconds (Warm-up) 91.1596 seconds (Sampling) 169.278 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' > model7 Family: exgaussian Links: mu = identity; sigma = identity; beta = identity Formula: RT_ms ~ Target * NumberDistractors Data: VSdata2 (Number of observations: 2720) Samples: 3 chains, each with iter = 8000; warmup = 2000; thin = 1; total post-warmup samples = 18000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 1171.58 24.67 1123.30 1219.96 11870 1.00 TargetPresent -59.42 31.33 -121.29 1.84 12077 1.00 NumberDistractors 16.99 0.71 15.59 18.37 11178 1.00 TargetPresent:NumberDistractors -10.53 0.90 -12.31 -8.76 10916 1.00 Family Specific Parameters: sigma 350.19 12.22 326.73 374.00 13159 1.00 beta 544.63 17.63 510.65 579.21 12275 1.00 Problem goes away if we use the full data set
Priors help Suppose you use a broad (non-informative) prior Uniform distribution Suppose your search process starts in a region with flat posterior values So you always move to a new set of parameter values But you move aimlessly It takes a long time to sample from regions of higher probability
Priors help Even a slightly non-uniform prior helps get the search process moving If the prior moves you in the general direction of the posterior, it can help a lot It could hurt the search in some cases, but then you probably have other problems! Once you get close to where the data likelihood dominates, the prior will hardly matter
Using brms Standard brm call that we have used many times model1 = brm(Leniency ~ SmileType, data = SLdata, iter = 2000, warmup = 200, chains = 3, thin = 2 ) print(summary(model1)) plot(model1) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.36 0.29 4.80 5.93 2620 1.00 SmileTypeFelt -0.46 0.41 -1.25 0.34 2556 1.00 SmileTypeMiserable -0.45 0.40 -1.24 0.32 2672 1.00 SmileTypeNeutral -1.25 0.40 -2.02 -0.44 2433 1.00 Family Specific Parameters: sigma 1.65 0.10 1.46 1.87 2650 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Standard brm call that we have used many times
Number of chains/iterations model2 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 1, thin = 2 ) print(summary(model2)) plot(model2) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 1 chains, each with iter = 200; warmup = 20; thin = 2; total post-warmup samples = 90 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.35 0.31 4.78 5.92 59 0.99 SmileTypeFelt -0.43 0.43 -1.24 0.35 66 0.99 SmileTypeMiserable -0.44 0.42 -1.29 0.39 61 0.99 SmileTypeNeutral -1.27 0.41 -2.18 -0.56 90 0.99 Family Specific Parameters: sigma 1.62 0.10 1.43 1.81 90 0.99 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Let’s break it down to smaller sets so we can understand what is happening
Chains Independent sampling “chains” are combined at the end to make a better estimate Why bother? Parallel processing If your computer has multiple processors, you can split up the chains across processors and get faster results Convergence checking Independent chains should converge to the same posterior distribution, if not then one or more chains may not be sampling well (and you should not trust the estimated posterior to be accurate) Rhat is a check on whether the chains converge
Number of chains/iterations model3 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 2, thin = 2 ) print(summary(model3)) plot(model3) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 2 chains, each with iter = 200; warmup = 20; thin = 2; total post-warmup samples = 180 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.35 0.43 4.70 5.93 87 1.05 SmileTypeFelt -0.46 0.39 -1.23 0.26 123 1.02 SmileTypeMiserable -0.47 0.43 -1.35 0.31 102 1.03 SmileTypeNeutral -1.24 0.43 -2.18 -0.51 102 1.01 Family Specific Parameters: sigma 1.67 0.21 1.46 1.90 123 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Let’s break it down to smaller sets so we can understand what is happening
Rhat Similar to F value in an ANOVA B is the standard deviation between chains W is the standard deviation within chains (pooled across all chains) When This is evidence that the chains have not converged (R=1) does not necessarily mean the chains have converged In practice, we have confidence in convergence if Otherwise, increase number of iterations (or change the model, more informative priors)
Iterations Think of it as similar to sample size for a confidence interval A bigger sample size leads to smaller standard error of the mean, and so a tighter confidence interval How much you need depends on what questions you want to answer. Tails of a distribution are harder to estimate than the center of a distribution Rule of thumb: 4000 iterations to have reasonable accuracy for 2.5th and 97.5th percentiles Might do better with Stan, though
Autocorrelation The nature of MCMC methods is that one takes a new sample “close” to the current sample Over the short-run, this can lead to correlations between samples Makes estimating the posterior imprecise It sorts itself out over the long-run Lots of tools to look for autocorrelation > launch_shinystan(model3) Launching ShinyStan interface... for large models this may take some time. Loading required package: shiny Listening on http://127.0.0.1:3422
Autocorrelation An autocorrelation plot compares correlation to “lag” (number of iterations away) Plot should drop dramatically as lag increases If not, then sampler might be caught in a “loop” Autocorrelation plots look fine for model3
Autocorrelation You want the bars to drop quickly as lag increases This would indicate low autocorrelation Bad Better, but still bad Good
Autocorrelation A lot of your intuitions about sampling come in to play here Ideally, you want independent samples Suppose you were estimating the mean, the standard error of the estimate from the posterior distribution would be Where sigma_p is the standard deviation of the posterior distribution and T is the sample size (number of iterations times number of chains) Correlated samples provide less information about your estimate
Effective Sample Size If there is correlation in the samples, we overestimate the standard error The loss of information from autocorrelation is expressed as an “effective sample size” You want to be sure your T* is big enough to properly estimate your parameters This is a crude measure, so plan accordingly
Thinning Should you have an autocorrelation problem, you can try to mitigate it by “thinning” the samples Just keep every nth sample You can also just increase the number of iterations model5 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 2, thin = 2 ) print(summary(model5)) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 2 chains, each with iter = 200; warmup = 20; thin = 2; total post-warmup samples = 180 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.30 0.33 4.66 5.81 58 1.01 SmileTypeFelt -0.38 0.45 -1.19 0.49 30 1.06 SmileTypeMiserable -0.38 0.43 -1.39 0.48 84 0.99 SmileTypeNeutral -1.20 0.44 -2.10 -0.44 74 1.00 Family Specific Parameters: sigma 1.65 0.10 1.50 1.85 145 1.01 model4 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 2, thin = 1 ) print(summary(model4)) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 2 chains, each with iter = 200; warmup = 20; thin = 1; total post-warmup samples = 360 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.40 0.29 4.78 5.92 129 1.00 SmileTypeFelt -0.51 0.42 -1.34 0.29 144 1.00 SmileTypeMiserable -0.49 0.41 -1.28 0.31 133 1.00 SmileTypeNeutral -1.30 0.41 -2.10 -0.54 128 1.01 Family Specific Parameters: sigma 1.65 0.10 1.45 1.84 360 1.00
Conclusions STAN Monte Carlo Markov Chains Hamiltonian MC Convergence Most of this happens in the background and you do not need to be an expert