Bayes factors as a measure of strength of evidence in replication studies Zoltán Dienes.

Slides:



Advertisements
Similar presentations
A small taste of inferential statistics
Advertisements

Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 12 Measures of Association.
Lecture 3: Null Hypothesis Significance Testing Continued Laura McAvinue School of Psychology Trinity College Dublin.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
1 Equivalence and Bioequivalence: Frequentist and Bayesian views on sample size Mike Campbell ScHARR CHEBS FOCUS fortnight 1/04/03.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
Chapter 19: Two-Sample Problems
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
PSY 307 – Statistics for the Behavioral Sciences
Three Common Misinterpretations of Significance Tests and p-values 1. The p-value indicates the probability that the results are due to sampling error.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Evidence Based Medicine
Hypothesis Testing Quantitative Methods in HPELS 440:210.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Instructor Resource Chapter 5 Copyright © Scott B. Patten, Permission granted for classroom use with Epidemiology for Canadian Students: Principles,
Chapter 21: More About Tests “The wise man proportions his belief to the evidence.” -David Hume 1748.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Probability.  Provides a basis for thinking about the probability of possible outcomes  & can be used to determine how confident we can be in an effect.
How to get the most out of null results using Bayes Zoltán Dienes.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
Lecture 18 Dustin Lueker.  A way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis ◦ Data that fall far.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Lecture 17 Dustin Lueker.  A way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis ◦ Data that fall far.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Ex St 801 Statistical Methods Inference about a Single Population Mean.
Retain H o Refute hypothesis and model MODELS Explanations or Theories OBSERVATIONS Pattern in Space or Time HYPOTHESIS Predictions based on model NULL.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
INTRODUCTION TO CLINICAL RESEARCH Introduction to Statistical Inference Karen Bandeen-Roche, Ph.D. July 12, 2010.
STA Lecture 221 !! DRAFT !! STA 291 Lecture 22 Chapter 11 Testing Hypothesis – Concepts of Hypothesis Testing.
Handout Seven: Independent-Samples t Test Instructor: Dr. Amery Wu
Confidence Intervals & Effect Size. Outline of Today’s Discussion 1.Confidence Intervals 2.Effect Size 3.Thoughts on Independent Group Designs.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
A significance test or hypothesis test is a procedure for comparing our data with a hypothesis whose truth we want to assess. The hypothesis is usually.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Chapter 8: Introduction to Hypothesis Testing. Hypothesis Testing A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis.
Chapter 13 Understanding research results: statistical inference.
FYP 446 /4 Final Year Project 2 Dr. Khairul Farihan Kasim FYP Coordinator Bioprocess Engineering Program Universiti Malaysia Perls.
Presented by Mo Geraghty and Danny Tran June 10, 2016.
How to get the most out of null results using Bayes Zoltán Dienes.
Estimating the reproducibility of psychological science: accounting for the statistical significance of the original study Robbie C. M. van Aert & Marcel.
How to get the most out of data with Bayes
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
Statistics 200 Lecture #9 Tuesday, September 20, 2016
Measurement, Quantification and Analysis
Statistical Data Analysis - Lecture /04/03
Statistics for the Social Sciences
Review You run a t-test and get a result of t = 0.5. What is your conclusion? Reject the null hypothesis because t is bigger than expected by chance Reject.
Central Limit Theorem, z-tests, & t-tests
Reasoning in Psychology Using Statistics
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Null Hypothesis Testing
Review of Probability and Estimators Arun Das, Jason Rebello
Week 11 Chapter 17. Testing Hypotheses about Proportions
Reasoning in Psychology Using Statistics
Analysis and Interpretation of Experimental Findings
PSY 626: Bayesian Statistics for Psychological Science
Chapter 12 Power Analysis.
Chapter 7: The Normality Assumption and Inference with OLS
More on Testing 500 randomly selected U.S. adults were asked the question: “Would you be willing to pay much higher taxes in order to protect the environment?”
Section 11-1 Review and Preview
Testing Hypotheses I Lesson 9.
1 Chapter 8: Introduction to Hypothesis Testing. 2 Hypothesis Testing The general goal of a hypothesis test is to rule out chance (sampling error) as.
Chapter 4 Summary.
Four reasons to prefer Bayesian over orthodox statistics
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
The Research Process & Surveys, Samples, and Populations
Presentation transcript:

Bayes factors as a measure of strength of evidence in replication studies Zoltán Dienes

No evidence to speak of Evidence for H0 Evidence for H1

No evidence to speak of Evidence for H0 Evidence for H1 P-values make a two-way distinction:

No evidence to speak of Evidence for H0 Evidence for H1 P-values make a two distinction: NO MATTER WHAT THE P-VALUE, NO DISTINCTION MADE WITHIN THIS BOX

No inferential conclusion follows from a non-significant result in itself But it is now easy to use Bayes and distinguish: Evidence for null hypothesis vs insensitive data

The Bayes Factor: Strength of evidence for one theory versus another (e.g. H1 versus H0): The data are B times more likely on H1 than H0

From the axioms of probability: P(H1 | D)=P(D | H1)*P(H1) P(H0 | D)P(D | H0)P(H0) Posterior confidence =Bayes factor* prior confidence in H1 rather than H0 Defining strength of evidence by the amount one’s belief ought to change, Bayes factor is a measure of strength of evidence

If B = about 1, experiment was not sensitive. If B > 1 then the data supported your theory over the null If B < 1, then the data supported the null over your theory Jeffreys, 1939: Bayes factors more than 3 are worth taking note of B > 3 noticeable support for theory B < 1/3 noticeable support for null

No evidence to speak of Evidence for H0 Evidence for H1 Bayes factors make the three way distinction: 1/3 … 3 3 … 0 … 1/3

A model of H0

A model of the data

A model of H0 A model of the data A model of H1

How do we model the predictions of H1? How to derive predictions from a theory? Predictions Theory

How do we model the predictions of H1? How to derive predictions from a theory? Predictions Theory assumptions

How do we model the predictions of H1? How to derive predictions from a theory? Predictions Theory assumptions Want assumptions that are a) informed; and b) simple

How do we model the predictions of H1? How to derive predictions from a theory? Theory assumptions Want assumptions that are a) informed; and b) simple Model of predictions Plausibility Magnitude of effect

Some points to consider: 1.Reproducibility project (osf, 2015): Published studies tend to have larger effect sizes than unbiased direct replications; 2.Many studies publicise effect sizes of around a Cohen’s d of 0.5 (Kühberger et al 2014); but getting effect sizes above a d of 1 very difficult (Simmons et al, 2013). Psychology Behavioural economics Original effect size Replication effect size

1.Assume a measured effect size is roughly right scale of effect 2.Assume rough maximum is about twice that size 3.Assume smaller effects more likely than bigger ones => Rule of thumb: If initial raw effect is E, then assume half-normal with SD = E Plausibility Possible population mean differences

0. Often significance testing will provide adequate answers

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than those primed with a female identity. M = 11%, t(29) = 2.02, p =.053

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than those primed with a female identity. M = 11%, t(29) = 2.02, p =.053 Gibson, Losee, and Vitiello (2014) M = 12%, t(81) = 2.40, p =.02.

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than those primed with a female identity. M = 11%, t(29) = 2.02, p =.053 Gibson, Losee, and Vitiello (2014) M = 12%, t(81) = 2.40, p =.02. B H(0, 11) = 4.50.

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold75%25% Warmth46%54% Ln OR = 1.26

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold75%25% Warmth46%54% Ln OR = 1.26 Lynott, Corker, Wortman, Connell et al (2014) N = 861 people ln OR = -0.26, p =.062

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold75%25% Warmth46%54% Ln OR = 1.26 Lynott, Corker, Wortman, Connell et al (2014) N = 861 people ln OR = -0.26, p =.062 B H(0, 1.26) = 0.04

Often Bayes and orthodoxy agree

1. A high powered non-significant result is not necessarily evidence for H0

Banerjee, Chatterjee, & Sinha, 2012, Study 2 recall unethical deeds 74 ethical deeds 88 Mean difference = 13.30, t(72)=2.70, p =.01, 0 effect size for H Estimated effect size for H1 Brandt et al (2012, lab replication): N = 121, Power > 0.9

Banerjee, Chatterjee, & Sinha, 2012, Study 2 recall unethical deeds 74 ethical deeds 88 Mean difference = 13.30, t(72)=2.70, p =.01, 0 effect size for H Estimated effect size for H1 Brandt et al (2014, lab replication): N = 121, Power > 0.9 t(119)=0.17, p = 0.87

Banerjee, Chatterjee, & Sinha, 2012, Study 2 recall unethical deeds 74 ethical deeds 88 Mean difference = 13.30, t(72)=2.70, p =.01, 0 effect size for H Estimated effect size for H1 Brandt et al (2014, lab replication): N = 121, Power > 0.9 t(119)=0.17, p = Sample mean

Banerjee, Chatterjee, & Sinha, 2012, Study 2 recall unethical deeds 74 ethical deeds 88 Mean difference = 13.30, t(72)=2.70, p =.01, 0 effect size for H Estimated effect size for H1 Brandt et al (2014, lab replication): N = 121, Power > 0.9 t(119)=0.17, p = 0.87, B H(0, 13.3) = Sample mean

A high powered non-significant result is not in itself evidence for the null hypothesis To know how much evidence you have for a point null hypothesis you must use a Bayes factor

2. A low-powered non-significant result is not necessarily insensitive

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than unprimed women Mean diff = 5%

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than unprimed women Mean diff = 5% Moon and Roeder (2014) ≈50 subjects in each group; power = 24% M = - 4% t(99) = 1.15, p = 0.25.

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than unprimed women Mean diff = 5% Moon and Roeder (2014) ≈50 subjects in each group; power = 24% M = - 4% t(99) = 1.15, p = B H(0, 5) = 0.31

Shih, Pittinsky, and Ambady (1999) American Asian women primed with an Asian identity will perform better on a maths test than unprimed women Mean diff = 5% Moon and Roeder (2014) ≈50 subjects in each group; power = 24% M = - 4% t(99) = 1.15, p = B H(0, 5) = 0.31 NB: A mean difference in the wrong direction does not necessarily count against a theory If SE twice as large then t(99) = 0.58, p =.57 B H(0, 5) = 0.63

The strength of evidence should depend on whether the difference goes in the predicted direction or not YET A difference in the wrong direction cannot automatically count as strong evidence

3. A high-powered significant result is not necessarily evidence for a theory

All conceivable outcomes Outcomes allowed by theory 1 Outcomes allowed by theory 2

All conceivable outcomes Outcomes allowed by theory 1 Outcomes allowed by theory 2 It should be harder to obtain evidence for a vague theory than a precise theory, even when predictions are confirmed. A theory should be punished for being vague

All conceivable outcomes Outcomes allowed by theory 1 Outcomes allowed by theory 2 It should be harder to obtain evidence for a vague theory than a precise theory, even when predictions are confirmed. A theory should be punished for being vague. A just significant result cannot provide a constant amount of evidence for an H1 over H0; the relative strength of evidence must depend on the H1

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold75%25% Warmth46%54% Ln OR = 1.26 Lynott, Corker, Wortman, Connell et al (2014) N = 861 people ln OR = -0.26, p =.062

Williams and Bargh (2008; study 2) asked 53 people to feel a hot or a cold therapeutic pack and then choose between a treat for themselves or for a friend. selfish treat prosocial Cold75%25% Warmth46%54% Ln OR = 1.26 Lynott, Corker, Wortman, Connell et al (2014) N = 861 people ln OR = -0.26, p =.062 Counterfactually, Ln OR = , p <.05 selfish treat prosocial Cold53.5%46.5% Warmth46.5%53.5%

Williams and Bargh (2008; study 2) N = 53 Ln OR = 1.26 Replication N = 861 Ln OR = , p <.05 0 effect size for H Estimated effect size for H1

Williams and Bargh (2008; study 2) N = 53 Ln OR = 1.26 Replication N = 861 Ln OR = , p <.05 0 effect size for H Estimated effect size for H1

Williams and Bargh (2008; study 2) N = 53 Ln OR = 1.26 Replication N = 861 Ln OR = , p <.05 B H(0, 1.26) = effect size for H Estimated effect size for H1

Vague theories should get less evidence from the same data than precise theories Yet p-values cannot reflect this

Main criticism of Bayes: Different models of H1 give different answers Compare: Different theories, or different assumptions connecting theory to predictions, make different predictions

Main criticism of Bayes: Different models of H1 give different answers Compare: Different theories, or different assumptions connecting theory to predictions, make different predictions “It is sometimes considered a paradox that the answer depends not only on the observations but also on the question; it should be a platitude” Jeffreys, 1939

There is no algorithm for making predictions from theory Just so, there is no algorithm for modelling theories Modelling H1 means getting to know your literature and your theory Doing Bayes just is doing science

In sum, P-values do not indicate evidence for H0 - not when power is high - not when power is low P-values do not provide evidence for H1 in ways sensitive to the properties of H1 By contrast Bayes factors provide a continuous measure of evidence motivated from first principles

“Falsifying hypothesis” (e.g. Washing hands affects particular DV) Probability model Specifies conditions under which direct replication should succeed More general theory (e.g. social and physical disgust are two variants of same thing) Specifies conditions for obtaining conceptual replications