Four reasons to prefer Bayesian over orthodox statistics

Slides:



Advertisements
Similar presentations
A small taste of inferential statistics
Advertisements

Mean, Proportion, CLT Bootstrap
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Beyond Null Hypothesis Testing Supplementary Statistical Techniques.
Chapter 10: Hypothesis Testing
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
1 Equivalence and Bioequivalence: Frequentist and Bayesian views on sample size Mike Campbell ScHARR CHEBS FOCUS fortnight 1/04/03.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Chapter 19: Two-Sample Problems
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
PSY 307 – Statistics for the Behavioral Sciences
Evidence Based Medicine
Hypothesis Testing Quantitative Methods in HPELS 440:210.
User Study Evaluation Human-Computer Interaction.
AP STATISTICS LESSON 10 – 2 DAY 1 TEST OF SIGNIFICANCE.
Chapter 20 Testing hypotheses about proportions
CHAPTER 17: Tests of Significance: The Basics
How to get the most out of null results using Bayes Zoltán Dienes.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Introduction to Hypothesis Testing: the z test. Testing a hypothesis about SAT Scores (p210) Standard error of the mean Normal curve Finding Boundaries.
Hypothesis Testing. “Not Guilty” In criminal proceedings in U.S. courts the defendant is presumed innocent until proven guilty and the prosecutor must.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
THE SCIENTIFIC METHOD: It’s the method you use to study a question scientifically.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Estimating the reproducibility of psychological science: accounting for the statistical significance of the original study Robbie C. M. van Aert & Marcel.
Bayes factors as a measure of strength of evidence in replication studies Zoltán Dienes.
How to get the most out of data with Bayes
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
PSY 626: Bayesian Statistics for Psychological Science
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Dependent-Samples t-Test
Statistics 200 Lecture #9 Tuesday, September 20, 2016
Measurement, Quantification and Analysis
Two-Sample Hypothesis Testing
Unit 5: Hypothesis Testing
Statistics for the Social Sciences
Review You run a t-test and get a result of t = 0.5. What is your conclusion? Reject the null hypothesis because t is bigger than expected by chance Reject.
Testing Hypotheses About Proportions
The binomial applied: absolute and relative risks, chi-square
Question 1: What is the baseline of high power?
Hypothesis Testing Is It Significant?.
Central Limit Theorem, z-tests, & t-tests
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Null Hypothesis Testing
Tests of significance: The basics
Hypothesis Tests for a Population Mean,
Week 11 Chapter 17. Testing Hypotheses about Proportions
PSY 626: Bayesian Statistics for Psychological Science
CHAPTER 26: Inference for Regression
Significance Tests: The Basics
The Practice of Statistics in the Life Sciences Fourth Edition
Categorical Data Analysis Review for Final
One-Way Analysis of Variance
Significance Tests: The Basics
PSY 626: Bayesian Statistics for Psychological Science
Testing Hypotheses About Proportions
Chapter 12 Power Analysis.
Chapter 7: The Normality Assumption and Inference with OLS
More on Testing 500 randomly selected U.S. adults were asked the question: “Would you be willing to pay much higher taxes in order to protect the environment?”
Section 11-1 Review and Preview
1 Chapter 8: Introduction to Hypothesis Testing. 2 Hypothesis Testing The general goal of a hypothesis test is to rule out chance (sampling error) as.
Chapter 4 Summary.
Mathematical Foundations of BME Reza Shadmehr
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
The Research Process & Surveys, Samples, and Populations
Presentation transcript:

Four reasons to prefer Bayesian over orthodox statistics Zoltán Dienes Harold Jeffreys 1891-1989

No evidence to speak of Evidence for H1 Evidence for H0

No evidence to speak of Evidence for H1 Evidence for H0 P-values make a two-way distinction: No evidence to speak of Evidence for H1 Evidence for H0

No evidence to speak of Evidence for H1 Evidence for H0 P-values make a two distinction: No evidence to speak of Evidence for H1 Evidence for H0 NO MATTER WHAT THE P-VALUE, NO DISTINCTION MADE WITHIN THIS BOX

No inferential conclusion follows from a non-significant result in itself But it is now easy to use Bayes and distinguish: Evidence for null hypothesis vs insensitive data

The Bayes Factor: Strength of evidence for one theory versus another (e.g. H1 versus H0): The data are B times more likely on H1 than H0

From the axioms of probability: P(H1 | D) = P(D | H1) * P(H1) P(H0 | D) P(D | H0) P(H0) Posterior confidence = Bayes factor * prior confidence in H1 rather than H0 Defining strength of evidence by the amount one’s belief ought to change, Bayes factor is a measure of strength of evidence

If B = about 1, experiment was not sensitive. If B > 1 then the data supported your theory over the null If B < 1, then the data supported the null over your theory Jeffreys, 1939: Bayes factors more than 3 are worth taking note of B > 3 noticeable support for theory B < 1/3 noticeable support for null

Bayes factors make the three way distinction: 0 … 1/3 1/3 … 3 3 … No evidence to speak of Evidence for H1 Evidence for H0

The symmetry of B (and not p) means: Can get evidence for H0 just as much for H1 - help against publication bias - people claim they have evidence against H1 only if they have such evidence Can run until evidence is strong enough (Optional stopping no longer a QRP) Less pressure to B-hack – and when it occurs can go in either direction.

A model of H0

A model of H0 A model of the data

A model of H0 A model of the data A model of H1

How do we model the predictions of H1? How to derive predictions from a theory? Theory Predictions

How do we model the predictions of H1? How to derive predictions from a theory? Theory assumptions Predictions

How do we model the predictions of H1? How to derive predictions from a theory? Theory assumptions Predictions Want assumptions that are a) informed; and b) simple

How do we model the predictions of H1? How to derive predictions from a theory? Theory assumptions Plausibility Model of predictions Magnitude of effect Want assumptions that are a) informed; and b) simple

Example Initial study: flashing the word “steep” makes people walk 5 seconds more slowly done a fixed length of corridor (20 versus 25 seconds). Follow up Study: flashes the word “elderly.” What size effect could be expected?

Some points to consider: Reproducibility project (osf, 2015): Published studies tend to have larger effect sizes than unbiased direct replications; Many studies publicise effect sizes of around a Cohen’s d of 0.5 (Kühberger et al 2014); but getting effect sizes above a d of 1 very difficult (Simmons et al, 2013). Original effect size Simmons et al: DV How many pairs of shoes do you own? IV Gender; d = 1.07; IV do you like spicy food? How much do you like Indian food? d = 0.80; IV gender DV height in inches d = 1.85 Average ES in social psych: r = .21, or d = 0.43 Richard, F. D., Bond, C. F., & Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331–363. doi:10.1037/1089-2680.7.4.331 Replication effect size Psychology Behavioural economics

Assume a measured effect size is roughly right scale of effect Assume rough maximum is about twice that size Assume smaller effects more likely than bigger ones => Rule of thumb: If initial raw effect is E, then assume half-normal with SD = E Plausibility Possible population mean differences

0. Often significance testing will provide adequate answers

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than those primed with a female identity. M = 11%, t(29) = 2.02, p = .053

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than those primed with a female identity. M = 11%, t(29) = 2.02, p = .053 Gibson, Losee, and Vitiello (2014) M = 12%, t(81) = 2.40, p = .02.

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than those primed with a female identity. M = 11%, t(29) = 2.02, p = .053 Gibson, Losee, and Vitiello (2014) M = 12%, t(81) = 2.40, p = .02. BH(0, 11) = 4.50.

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold 75% 25% Warmth 46% 54% Ln OR = 1.26

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold 75% 25% Warmth 46% 54% Ln OR = 1.26 Lynott, Corker, Wortman, Connell et al (2014) N = 861 people ln OR = -0.26, p = .062

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold 75% 25% Warmth 46% 54% Ln OR = 1.26 Lynott, Corker, Wortman, Connell et al (2014) N = 861 people ln OR = -0.26, p = .062 BH(0, 1.26) = 0.04

Often Bayes and orthodoxy agree

1. A high powered non-significant result is not necessarily evidence for H0

Banerjee, Chatterjee, & Sinha, 2012, Study 2 recall unethical deeds 74 Mean difference = 13.30, t(72)=2.70, p = .01, effect size for H0 13.30 Estimated effect size for H1 Banerjee SE = 4.93 Brandt et al (2012, lab replication): N = 121, Power > 0.9

Banerjee, Chatterjee, & Sinha, 2012, Study 2 recall unethical deeds 74 Mean difference = 13.30, t(72)=2.70, p = .01, effect size for H0 13.30 Estimated effect size for H1 Banerjee SE = 4.93 Brandt et al (2014, lab replication): N = 121, Power > 0.9 t(119)=0.17, p = 0.87

Banerjee, Chatterjee, & Sinha, 2012, Study 2 recall unethical deeds 74 Mean difference = 13.30, t(72)=2.70, p = .01, effect size for H0 5.47 Sample mean 13.30 Estimated effect size for H1 Banerjee SE = 4.93 Brandt et al (2014, lab replication): N = 121, Power > 0.9 t(119)=0.17, p = 0.87

Banerjee, Chatterjee, & Sinha, 2012, Study 2 recall unethical deeds 74 Mean difference = 13.30, t(72)=2.70, p = .01, effect size for H0 5.47 Sample mean 13.30 Estimated effect size for H1 Banerjee SE = 4.93 Brandt et al (2014, lab replication): N = 121, Power > 0.9 t(119)=0.17, p = 0.87, BH(0, 13.3) = 0.97

A high powered non-significant result is not in itself evidence for the null hypothesis To know how much evidence you have for a point null hypothesis you must use a Bayes factor

2. A low-powered non-significant result is not necessarily insensitive

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than unprimed women Mean diff = 5%

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than unprimed women Mean diff = 5% Moon and Roeder (2014) ≈50 subjects in each group; power = 24% M = - 4% t(99) = 1.15, p = 0.25.

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than unprimed women Mean diff = 5% Moon and Roeder (2014) ≈50 subjects in each group; power = 24% M = - 4% t(99) = 1.15, p = 0.25. BH(0, 5) = 0.31

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than unprimed women Mean diff = 5% Moon and Roeder (2014) ≈50 subjects in each group; power = 24% M = - 4% t(99) = 1.15, p = 0.25. BH(0, 5) = 0.31 NB: A mean difference in the wrong direction does not necessarily count against a theory If SE twice as large then t(99) = 0.58, p = .57 BH(0, 5) = 0.63

The strength of evidence should depend on whether the difference goes in the predicted direction or not YET A difference in the wrong direction cannot automatically count as strong evidence

3. A high-powered significant result is not necessarily evidence for a theory

All conceivable outcomes Outcomes allowed by theory 1 Outcomes allowed by theory 2

All conceivable outcomes Outcomes allowed by theory 1 Outcomes allowed by theory 2 It should be harder to obtain evidence for a vague theory than a precise theory, even when predictions are confirmed. A theory should be punished for being vague

All conceivable outcomes Outcomes allowed by theory 1 Outcomes allowed by theory 2 It should be harder to obtain evidence for a vague theory than a precise theory, even when predictions are confirmed. A theory should be punished for being vague. A just significant result cannot provide a constant amount of evidence for an H1 over H0; the relative strength of evidence must depend on the H1

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold 75% 25% Warmth 46% 54% Ln OR = 1.26 Lynott, Corker, Wortman, Connell et al (2014) N = 861 people ln OR = -0.26, p = .062

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold 75% 25% Warmth 46% 54% Ln OR = 1.26 Lynott, Corker, Wortman, Connell et al (2014) N = 861 people ln OR = -0.26, p = .062 Counterfactually, Ln OR = + 0.28, p < .05 Cold 53.5% 46.5% Warmth 46.5% 53.5%

Williams and Bargh (2008; study 2) Ln OR = 1.26 Replication N = 861 Ln OR = + 0.28, p < .05 effect size for H0 1.26 Estimated effect size for H1

Williams and Bargh (2008; study 2) Ln OR = 1.26 Replication N = 861 Ln OR = + 0.28, p < .05 effect size for H0 1.26 Estimated effect size for H1

Williams and Bargh (2008; study 2) Ln OR = 1.26 Replication N = 861 Ln OR = + 0.28, p < .05 BH(0, 1.26) = 1.56 effect size for H0 1.26 Estimated effect size for H1

Vague theories should get less evidence from the same data than precise theories Yet p-values cannot reflect this

4. The answer to the question should depend on the question

Schnall, Benton, and Harvey (2008): People make less severe judgments on 1 (perfectly OK) to 7 (extremely wrong) scale when they wash their hands after experiencing disgust (Exp. 2) Mean Difference for trolley problem: = 1.11 SE = 0.43, t(41) = 2.57, p = .014

Schnall, Benton, and Harvey (2008): People make less severe judgments on 1 (perfectly OK) to 7 (extremely wrong) scale when they wash their hands after experiencing disgust (Exp. 2) Mean Difference: = 1.11 SE = 0.43, t(41) = 2.57, p = .014 Brandt et al 2014 N = 132 , power > 0.99 M = 0.15  SE = 0.24, t (130) = 0.63, p = 0.53

Schnall, Benton, and Harvey (2008): People make less severe judgments on 1 (perfectly OK) to 7 (extremely wrong) scale when they wash their hands after experiencing disgust (Exp. 2) Mean Difference: = 1.11 SE = 0.43, t(41) = 2.57, p = .014 Brandt et al 2014 N = 132 , power > 0.99 M = 0.15  SE = 0.24, t (130) = 0.63, p = 0.53 6

Schnall, Benton, and Harvey (2008): People make less severe judgments on 1 (perfectly OK) to 7 (extremely wrong) scale when they wash their hands after experiencing disgust (Exp. 2) Mean Difference: = 1.11 SE = 0.43, t(41) = 2.57, p = .014 Brandt et al 2014 N = 132 , power > 0.99 M = 0.15  SE = 0.24, t (130) = 0.63, p = 0.53 BU[0,6] = 0.09 6

Schnall, Benton, and Harvey (2008): People make less severe judgments on 1 (perfectly OK) to 7 (extremely wrong) scale when they wash their hands after experiencing disgust (Exp. 2) Mean Difference: = 1.11 SE = 0.43, t(41) = 2.57, p = .014 Brandt et al 2014 N = 126, power > 0.99 M = 0.15  SE = 0.24, t (124) = 0.63, p = 0.53 BH(1.11) = 0.37 effect size for H0 1.11 Estimated effect size for H1

Different models of H1 give different answers! What were they thinking of when they told us to use Bayes factors?

The half-normal answers the question: whether the replication can find an effect of the same order of size as the original and that is the question of interest.

Main criticism of Bayes: Different models of H1 give different answers Compare: Different theories, or different assumptions connecting theory to predictions, make different predictions

Main criticism of Bayes: Different models of H1 give different answers Compare: Different theories, or different assumptions connecting theory to predictions, make different predictions “It is sometimes considered a paradox that the answer depends not only on the observations but also on the question; it should be a platitude” Jeffreys, 1939

There is no algorithm for making predictions from theory Just so, there is no algorithm for modelling theories Modelling H1 means getting to know your literature and your theory Doing Bayes just is doing science

In sum, P-values do not indicate evidence for H0 - not when power is high - not when power is low P-values do not provide evidence for H1 in ways sensitive to the properties of H1 By contrast Bayes factors provide a continuous measure of evidence motivated from first principles