How to get the most out of null results using Bayes Zoltán Dienes.

Slides:



Advertisements
Similar presentations
Bayesian inference PHILOSOPHY OF SCIENCE: Thomas Bayes
Advertisements

Equivalence Testing Dig it!.
Inferential Statistics and t - tests
CHAPTER 15: Tests of Significance: The Basics Lecture PowerPoint Slides The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner.
Chapter 16 Inferential Statistics
Inference Sampling distributions Hypothesis testing.
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
Testing Hypotheses About Proportions Chapter 20. Hypotheses Hypotheses are working models that we adopt temporarily. Our starting hypothesis is called.
Review: What influences confidence intervals?
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Chapter 9 Hypothesis Tests. The logic behind a confidence interval is that if we build an interval around a sample value there is a high likelihood that.
Elementary hypothesis testing
Nemours Biomedical Research Statistics March 19, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Fundamentals of Hypothesis Testing. Identify the Population Assume the population mean TV sets is 3. (Null Hypothesis) REJECT Compute the Sample Mean.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Chapter 9 Hypothesis Testing.
BCOR 1020 Business Statistics
PSY 307 – Statistics for the Behavioral Sciences
Probability Population:
Inferential Statistics
The problem of sampling error in psychological research We previously noted that sampling error is problematic in psychological research because differences.
Statistical hypothesis testing – Inferential statistics I.
Hypothesis Testing:.
CHAPTER 10: Hypothesis Testing, One Population Mean or Proportion
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Tests of significance & hypothesis testing Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics.
14. Introduction to inference
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
The Argument for Using Statistics Weighing the Evidence Statistical Inference: An Overview Applying Statistical Inference: An Example Going Beyond Testing.
Chapter 8 Introduction to Hypothesis Testing
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
Chapter 21: More About Tests “The wise man proportions his belief to the evidence.” -David Hume 1748.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Chapter 20 Testing hypotheses about proportions
Hypotheses tests for means
Lecture 16 Dustin Lueker.  Charlie claims that the average commute of his coworkers is 15 miles. Stu believes it is greater than that so he decides to.
CHAPTER 9 Testing a Claim
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
C M Clarke-Hill1 Analysing Quantitative Data Forming the Hypothesis Inferential Methods - an overview Research Methods.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
CHAPTER 9 Testing a Claim
Chapter 8 Parameter Estimates and Hypothesis Testing.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
AP Statistics Section 11.1 B More on Significance Tests.
STA Lecture 221 !! DRAFT !! STA 291 Lecture 22 Chapter 11 Testing Hypothesis – Concepts of Hypothesis Testing.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 9 Testing a Claim 9.2 Tests About a Population.
Chapter 13 Understanding research results: statistical inference.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 9 Testing a Claim 9.1 Significance Tests:
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
How to get the most out of null results using Bayes Zoltán Dienes.
Bayes factors as a measure of strength of evidence in replication studies Zoltán Dienes.
Unit 4 – Inference from Data: Principles
Unit 5 – Chapters 10 and 12 What happens if we don’t know the values of population parameters like and ? Can we estimate their values somehow?
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
CHAPTER 9 Testing a Claim
Chapter 9 Hypothesis Testing.
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
Presentation transcript:

How to get the most out of null results using Bayes Zoltán Dienes

The problem: Does a non-significant result count as evidence for the null hypothesis or as no evidence either way?

Geoff Cummin:

The solutions: 1.Power 2.Interval estimates 3.Bayes Factors

Problems with Power I)Power depends on specifying the minimal effect of interest (which may be poorly specified by the theory) II)Power cannot make use of your actual data to determine the sensitivity of those data Confidence intervals solve the second problem Bayes Factors can solve both problems By making use of the full range of predictions of the theory, it makes maximal use of the data in assessing the sensitivity of the data in distinguishing your theory from the null A Bayes Factor can show strong evidence for the null hypothesis over your theory, when it is impossible to say anything using power or confidence intervals

Difference between means -> Minimal interesting value If the 95% confidence/ credibility/ likelihood interval is completely contained in this region, conclude there is good evidence that population value lies in null region – accept the null region hypothesis If the interval is completely outside this region, conclude there is good evidence that population value lies outside null region – reject the null region hypothesis 0 Null region If the upper limit of the interval is below the minimal interesting value, conclude there is evidence against a theory postulating a positive difference If the interval includes both null and theoretically interesting values, the data are insensitive The four principles of inference by intervals:

The Bayes Factor

Cat hypothesis Devil hypothesis

If a cat, you lose finger only 1/10 of time If a devil, you will lose finger 9/10 of time

Evidence supports the theory that most strongly predicted it

John puts his hand in the box and loses a finger. Which hypothesis is most strongly supported, the cat hypothesis or the devil hypothesis?

Evidence supports a theory that most strongly predicted it John puts his hand in the box and loses a finger. Which hypothesis is most strongly supported, the cat hypothesis or the devil hypothesis? Cat hypothesis predicts this result with probability = 1/10 Devil hypothesis predicts this result with probability = 9/10

Evidence supports a theory that most strongly predicted it John puts his hand in the box and loses a finger. Which hypothesis is most strongly supported, the cat hypothesis or the devil hypothesis? Cat hypothesis predicts this result with probability = 1/10 Devil hypothesis predicts this result with probability = 9/10 Strength of evidence for devil over cat hypothesis = 9/10 divided by 1/10 = 9

The evidence is nine times as strong for the devil over the cat hypothesis OR Bayes Factor (B) = 9

Consider: John does not lose a finger

Consider: John does not lose a finger Now evidence strongly supports cat over devil hypothesis (BF = 9 for cat over devil hypothesis or 1/9 for devil over cat hypothesis )

Probability of losing finger given cat = 4/10 Probability of losing finger given devil = 6/10 Now if John loses finger strength of evidence for devil over cat = 6/4 = 1.5 Not very strong

We can distinguish: Evidence for cat hypothesis over devil Evidence for devil hypothesis over cat Not much evidence either way.

Bayes factor tells you how strongly the data are predicted by the different theories (e.g. your pet theory versus null hypothesis): B = Probability of your data given your pet theory divided by probability of data given null hypothesis

If B is greater than 1 then the data supported your theory over the null If B is less than 1, then the data supported the null over your theory If B = about 1, experiment was not sensitive. (Automatically get a notion of sensitivity; contrast: just relying on p values in significance testing.) Jeffreys, 1961: Bayes factors more than 3 or less than a 1/3 are substantial

To know which theory data support need to know what the theories predict The null is normally the prediction of e.g. no difference Population difference between conditions Plausibility On the null hypothesis only this value is plausible

To know which theory data support need to know what the theories predict The null is normally the prediction of e.g. no difference Need to decide what difference or range of differences are consistent with one’s theory Difficult - but forces one to think clearly about one’s theory.

To calculate a Bayes factor must decide what range of differences are predicted by the theory 1)Uniform distribution 2)Half normal 3)Normal

Plausibility Population difference in means between conditions Example: The theory predicts a difference will be in one direction. Subjects give 0-8 ratings in two conditions Maximum difference allowed -2

Seems more plausible to think the larger effects are less likely than the smaller ones: 0 Plausibility Population difference in means between conditions But how to scale the rate of drop?

04 Implies: Smaller effects more likely than bigger ones; effects bigger than 8 very unlikely Plausibility Population difference in means between conditions

Similar sorts of effects as those predicted in the past have been on the order of a 5% difference between conditions 05 Implies: Smaller effects more likely than bigger ones; effects bigger than 10% very unlikely Plausibility Population difference in means between conditions

Plausibility Difference between conditions

To calculate Bayes factor in a t-test situation Need same information from the data as for a t-test: Mean difference, Mdiff SE of difference, SEdiff

To calculate Bayes factor in a t-test situation Need same information from the data as for a t-test: Mean difference, Mdiff SE of difference, SEdiff Note: t = Mdiff / SEdiff => SEdiff = Mdiff/t

To calculate a Bayes factor: 1) Google “Zoltan Dienes” 2) First site to come up is the right one: 3) Click on link to book 4) Click on link to Chapter Four 5) Scroll down and click on “Click here to calculate your Bayes factor!”

Bayes p The tai chi of the Bayes factors The dance of the p values

A Bayes Factor requires establishing predicted effect sizes. How? Do digit-colour synesthetes show a Stroop effect on digits? You display: 3 … 4 … 5 … 6 What they see: 3 … 4 … 5 … 6 You get a null effect (incongruent minus congruent RTs)... What size effect would be predicted if there were one?

A Bayes Factor requires establishing predicted effect sizes. Do digit-colour synesthetes show a Stroop effect on digits? You display: 3 … 4 … 5 … 6 What they see: 3 … 4 … 5 … 6 You get a null effect (incongruent minus congruent RTs)... What size effect would be predicted if there were one? Run normals on a condition in which digits are coloured in the way synesthetes say they are. The Stroop effect is presumably the maximum one could expect synesthetes to show. Use a uniform: 0 Effect for normals with real colours Plausibility Possible population Stroop effects

Another condition in your experiment might help settle expectations: Jiang et al 2012 Obtained significant amount of unconscious knowledge (5%) Conscious knowledge was 6% with a SE of 7%

Another condition in your experiment might help settle expectations: Jiang et al 2012 Obtained significant amount of unconscious knowledge (5%) Conscious knowledge was 6% with a SE of 7% (non-significant) To assess meaning of non-significant result, used a half-normal with SD = 5% BF = 1.25

Another group in your experiment might help settle expectations: Jiang et al 2012 Obtained significant amount of unconscious knowledge (5%) Conscious knowledge was 6% with a SE of 7% Used a half-normal with SD = 5% BF = 1.25 Nothing follows about whether subjects had conscious knowledge or not

If you have a manipulation meant to reduce an effect, effect of manipulation unlikely to be larger than the basic effect e.g. Dienes, Baddeley & Jansari (2012) Predicted sad mood would reduce learning compared to neutral mood So e.g. if on 2-alternative forced choice test, in neutral condition people get 70% correct

If you have a manipulation meant to reduce an effect, effect of manipulation unlikely to be larger than the basic effect e.g. Dienes, Baddeley & Jansari (2012) Predicted sad mood would reduce learning compared to neutral mood So e.g. if on 2-alternative forced choice test, in neutral condition people get 70% correct Sad condition expected to be somewhere between 50 and 70% So effect of mood must be?

My typical practice: If think of way of determining an approximate expected size of effect  Use half normal with SD = to that typical size If think of way of determining an approximate upper limit of effect => Use uniform from 0 to that limit

Moral and inferential paradoxes of orthodoxy: 1.On the orthodox approach, standardly you should plan in advance how many subjects you will run. If you just miss out on a significant result you are not allowed to just run 10 more subjects and test again. You are not allowed to run until you get a significant result. Bayes: It does not matter when you decide to stop running subjects. You can always run more subjects if you think it will help.

Moral paradox: If p =.07 after running planned number of subjects i)If you run more and report significant at 5% you have cheated ii)If you don’t run more and bin the results you have wasted tax payer’s money and your time, and wasted relevant data You are morally damned either way Inferential paradox Two people with the same data and theories could draw opposite conclusions

Moral and inferential paradoxes of orthodoxy: 2. On the orthodox approach, it matters whether you formulated your hypothesis before or after looking at the data. Post hoc vs planned comparisons Predictions made in advance of rather than before looking at the data are treated differently Bayesian inference: It does not matter what day of the week you thought of your theory The evidence for your theory is just as strong regardless of its timing

Moral and inferential paradoxes of orthodoxy: 3. On the orthodox approach, you must correct for how many tests you conduct in total. For example, if you ran 100 correlations and 4 were just significant, researchers would not try to interpret those significant results. On Bayes, it does not matter how many other statistical hypotheses you investigated (or your RA without telling you). All that matters is the data relevant to each hypothesis under investigation.

For orthodoxy but not Bayes: Different people with the same data and theories can come to different conclusions You can thus be tempted to make false (albeit inferentially irrelevant claims), like when you thought of your theory

What is the aim of statistics? 1)Control the proportion of errors you make in the long run in accepting and rejecting hypotheses (conventional statistics) 2) Indicate how strong the evidence is for one hypothesis rather than another / how much you should change your confidence in one hypothesis rather than another (Bayesian statistics)

Dienes 2011 Perspectives on Psychological Science