PSY 626: Bayesian Statistics for Psychological Science 2/17/2019 Bayes Factors Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University PSY200 Cognitive Psychology
Hypothesis testing Suppose the null is true and check to see if a rare event has occurred e.g., does our random sample produce a t value that is in the tails of the null sampling distribution? If a rare event occurred, reject the null hypothesis
Hypothesis testing But what is the alternative? Typically: “anything goes” But that seems kind of unreasonable Maybe the “rare event” would be even less common if the null were not true!
Bayes Theorem Conditional probabilities
Ratio Ratio of posteriors conveniently cancels out P(D) Posterior odds Bayes Factor Prior odds
Bayesian Model Selection It’s not really about hypotheses, but hypotheses suggest models The Bayes Factor is often presented as BF12 You could also compute BF21 Posterior odds Bayes Factor Prior odds
2/17/2019 Bayes Factor Evidence for the alternative hypothesis (or the null) is computed with the Bayes Factor (BF) BF>1 indicates that the data is evidence for the alternative, compared to the null BF<1 indicates that the data is evidence for the null, compared to the alternative PSY200 Cognitive Psychology 7
2/17/2019 Bayes Factor When BF10 = 2, the data are twice as likely under H1 as under H0. When BF01 = 2, the data are twice as likely under H0 as under H1. These interpretations do not require you to believe that one model is better than the other You can still have priors that favor one model, regardless of the Bayes Factor You would want to make important decisions based on the posterior Still, if you consider both models to be plausible, then the priors should not be so different from each other PSY200 Cognitive Psychology 8
2/17/2019 Rules of thumb Evidence for the alternative hypothesis (or the null) is computed with the Bayes Factor (BF) BF10 Interpretation <0.01 Decisive evidence for null 0.01 to 0.1 Strong evidence for null 0.1 to 0.3 Substantial evidence for null 0.3 to 1 Anecdotal evidence for null 1 to 3 Anecdotal evidence for alternative 3 to 10 Substantial evidence for alternative 10 to 100 Strong evidence for alternative >100 Decisive evidence for alternative PSY200 Cognitive Psychology 9
Similar to AIC For a two-sample t-test, the null hypothesis (reduced model) is that a score from group s (1 or 2) is defined as With the same mean for each group s X12 X21 X22 X11
AIC For a two-sample t-test, the alternative hypothesis (full model) is that a score from group s (1 or 2) is defined as With different means for each group s X11 X21 X12 X22
AIC AIC and its variants are a way of comparing model structures One mean or two means? Always uses maximum likelihood estimates of the parameters Bayesian approaches identify a posterior distribution of parameter values We should use that information!
Models of what? We have been building models of trial-level scores # Model without intercept (more natural) model2 = brm(Leniency ~ 0+SmileType, data = SLdata, iter = 2000, warmup = 200, chains = 3) print(summary(model2)) GrandSE = 10 stanvars <-stanvar(GrandMean, name='GrandMean') + stanvar(GrandSE, name='GrandSE') prs <- c(prior(normal(GrandMean, GrandSE), class = "b") ) model6 = brm(CorrectResponses ~ 0 + Dosage + (1 |SubjectID), data = ATdata, iter = 2000, warmup = 200, chains = 3, thin = 2, prior = prs, stanvars=stanvars ) print(summary(model6))
Models of what? We have been building models of trial-level scores That is not the only option In traditional hypothesis testing, we care more about effect sizes than about individual scores Signal-to-noise ratio Of course, the effect size is derived from the individual scores In many cases, it is enough to just model the effect size itself rather than the individual scores Cohen’s d t-statistic p-value Correlation r “Sufficient” statistic
Models of means It’s not really going to be practical, but let’s consider a case where we assume that the population variance is known (and equals 1) and we want to compare null and alternative hypotheses of fixed values
Models of means The likelihood of any given observed mean value is derived from the sampling distribution Suppose n=100 (one sample)
Models of means The likelihood of any given observed mean value is derived from the sampling distribution Suppose n=100 (one sample) Suppose we observe Data are more likely under null than under alternative
Models of means The likelihood of any given observed mean value is derived from the sampling distribution Suppose n=100 (one sample) Suppose we observe Data are more likely under alternative than under null
Models of means The likelihood of any given observed mean value is derived from the sampling distribution Suppose n=100 (one sample) Suppose we observe Data are more likely under alternative than under null
Bayes Factor The ratio of likelihood for the data under the null compared to the alternative Or the other way around Suppose we observe Data are more likely under alternative than under null
Decision depends on alternative The likelihood of any given observed mean value is derived from the sampling distribution Suppose n=100 (one sample) Suppose we observe Data are more likely under null than under alternative
Decision depends on alternative The likelihood of any given observed mean value is derived from the sampling distribution Suppose n=100 (one sample)
Decision depends on alternative The likelihood of any given observed mean value is derived from the sampling distribution Suppose n=100 (one sample)
Decision depends on alternative For a fixed sample mean, evidence for the alternative only happens for alternative population mean values of a given range For big alternative values, the observed sample mean is less likely than for a null population value The sample mean may be unlikely for both models Rouder et al. (2009) Evidence for null Evidence for alternative Mean of alternative
Models of means Typically, we do not hypothesize a specific value for the alternative, but a range of plausible values
Likelihoods For the null, we compute likelihood in the same way Suppose n=100 (one sample)
Likelihoods For the alternative, we have to consider each possible value of mu, compute the likelihood of the sample mean for that value, and then average across all possible values Suppose n=100 (one sample)
Likelihoods For the alternative, we have to consider each possible value of mu, compute the likelihood of the sample mean for that value, and then average across all possible values Suppose n=100 (one sample)
Likelihoods For the alternative, we have to consider each possible value of mu, compute the likelihood of the sample mean for that value, and then average across all possible values Suppose n=100 (one sample)
Likelihoods For the alternative, we have to consider each possible value of mu, compute the likelihood of the sample mean for that value, and then average across all possible values Suppose n=100 (one sample)
Average Likelihood For the alternative, we have to consider each possible value of mu, compute the likelihood of the sample mean for that value, and then average across all possible values Suppose n=100 (one sample) Prior for value of mu Likelihood for given value of mu (from sampling distribution)
Bayes Factor Ratio of the likelihood for the null compared to the (average) likelihood for the alternative P(D | H1)
Uncertainty The prior standard deviation for mu establishes a range of plausible values for mu Less flexible More flexible
Uncertainty With a very narrow prior, you may not fit the data 0.15 0.15 Less flexible More flexible
Uncertainty With a very broad prior, you will fit well for some values of mu and poorly for other values of mu 0.15 0.15 Less flexible More flexible
Uncertainty Uncertainty in the prior functions similar to the penalty for parameters in AIC 0.15 0.15 Less flexible More flexible
Penalty Averaging acts like a penalty for extra parameters Rouder et al. (2009) Evidence for null Evidence for alternative Width of alternative prior
Models of effect size Consider the case of two-sample t-test We often care about the standardized effect size Which we can estimate from data as:
Models of effect size If we were doing traditional hypothesis testing, we would compare a null model: Against an alternative: Equivalent statements can be made using the standardized effect size As long as the standard deviation is not zero
Priors on effect size For the null, the prior is (again) a spike at zero
JZS Priors on effect size For the alternative, a good choice is a Cauchy distribution (t-distribution with df=1) Rouder et al. (2009) Jeffreys, Zellner, Siow
JZS Priors on effect size It is a good choice because the integration for the alternative hypothesis can be done numerically t is the t-value you use in a hypothesis test (from the data) v is the “degrees of freedom” (from the data) This might not look easy, but it is simple to calculate with a computer
Variations of JZS Priors Scale parameter “r” Bigger values make for a broader prior More flexibility! More penalty!
Variations of JZS Priors Medium r= 1 Wide r= sqrt(2)/2 Ultrawide r=sqrt(2)
How do we use it? Super easy Rouder’s web site: http://pcl.missouri.edu/bayesfactor In R library(BayesFactor)
How do we use it?
How do we use it? library(BayesFactor) ttest.tstat(t=2.2, n1=15, n2=15, simple=TRUE) B10 1.993006
What does it mean? Guidelines BF Evidence 1 – 3 Anecdotal 3 – 10 Substantial 10 – 30 Strong 30 – 100 Very strong >100 Decisive
Conclusions JZS Bayes Factors Easy to calculate Pretty easy to understand results A bit arbitrary for setting up Why not other priors? How to pick scale factor? Criteria for interpretation are arbitrary Fairly painless introduction to Bayesian methods