Estimating the reproducibility of psychological science: accounting for the statistical significance of the original study Robbie C. M. van Aert & Marcel.

Slides:



Advertisements
Similar presentations
EVAL 6970: Meta-Analysis Vote Counting, The Sign Test, Power, Publication Bias, and Outliers Dr. Chris L. S. Coryn Spring 2011.
Advertisements

1. Estimation ESTIMATION.
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
PSY 307 – Statistics for the Behavioral Sciences
Behavioural Science II Week 1, Semester 2, 2002
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Sample size computations Petter Mostad
Statistics. Overview 1. Confidence interval for the mean 2. Comparing means of 2 sampled populations (or treatments): t-test 3. Determining the strength.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 10: Hypothesis Tests for Two Means: Related & Independent Samples.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
PSY 307 – Statistics for the Behavioral Sciences
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of scientific research When you know the system: Estimation.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Meta-analysis & psychotherapy outcome research
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE © 2012 The McGraw-Hill Companies, Inc.
PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 6 Chicago School of Professional Psychology.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
PSY 307 – Statistics for the Behavioral Sciences
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Chapter 13 – 1 Chapter 12: Testing Hypotheses Overview Research and null hypotheses One and two-tailed tests Errors Testing the difference between two.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
Basic Statistics. Basics Of Measurement Sampling Distribution of the Mean: The set of all possible means of samples of a given size taken from a population.
T-test Mechanics. Z-score If we know the population mean and standard deviation, for any value of X we can compute a z-score Z-score tells us how far.
Using the Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects Richard Williams
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
PSY2004 Research Methods PSY2005 Applied Research Methods Week Eleven Stephen Nunn.
Inferential Statistics 2 Maarten Buis January 11, 2006.
Conducting a User Study Human-Computer Interaction.
Introduction Osborn. Daubert is a benchmark!!!: Daubert (1993)- Judges are the “gatekeepers” of scientific evidence. Must determine if the science is.
Adapted from: Wulff HR, Andersen B, Brandenhoff P, Guttler F (1987): What do doctors know about statistics? Statistics in Medicine 6:3-10 Suppose we conduct.
Statistical Power The power of a test is the probability of detecting a difference or relationship if such a difference or relationship really exists.
2 Accuracy and Precision Accuracy How close a measurement is to the actual or “true value” high accuracy true value low accuracy true value 3.
Section 10.1 Confidence Intervals
Human-Computer Interaction. Overview What is a study? Empirically testing a hypothesis Evaluate interfaces Why run a study? Determine ‘truth’ Evaluate.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
 Descriptive Methods ◦ Observation ◦ Survey Research  Experimental Methods ◦ Independent Groups Designs ◦ Repeated Measures Designs ◦ Complex Designs.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
Confidence Intervals & Effect Size. Outline of Today’s Discussion 1.Confidence Intervals 2.Effect Size 3.Thoughts on Independent Group Designs.
Objective Evaluation of Intelligent Medical Systems using a Bayesian Approach to Analysis of ROC Curves Julian Tilbury Peter Van Eetvelt John Curnow Emmanuel.
Essential Statistics Chapter 171 Two-Sample Problems.
The Idea of the Statistical Test. A statistical test evaluates the "fit" of a hypothesis to a sample.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
CHAPTER 7: TESTING HYPOTHESES Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
April Center for Open Fostering openness, integrity, and reproducibility of scientific research.
Chapter 9 Introduction to the t Statistic
Bayes factors as a measure of strength of evidence in replication studies Zoltán Dienes.
Is High Placebo Response Really a Problem in Clinical Trials?
Section Testing a Proportion
Statistics for the Social Sciences
Bayesian data analysis
Inference and Tests of Hypotheses
Ch3: Model Building through Regression
Math 4030 – 10a Tests for Population Mean(s)
A Closer Look at Testing
Conducting a User Study
Central Limit Theorem, z-tests, & t-tests
PSY 626: Bayesian Statistics for Psychological Science
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Cross-validation for the selection of statistical models
PSY 626: Bayesian Statistics for Psychological Science
Basic Practice of Statistics - 3rd Edition Two-Sample Problems
Statistics for the Social Sciences
CS639: Data Management for Data Science
Presentation transcript:

Estimating the reproducibility of psychological science: accounting for the statistical significance of the original study Robbie C. M. van Aert & Marcel A. L. M. van Assen Tilburg University & Utrecht University 1

Social Sciences Meta-Research Group 2

The Problem Example (Maxwell et al., 2015, in Am Psy) Independent sample t-test Original: d = 0.5, t(78) = 2.24, p = Replication (power =.8 at d = 0.5) d = 0.23, t(170) = 1.50, p= Conclusion?!? Omnipresent and relevant problem: 61% in RPP Questions considered relevant 1)Does effect exist? (0 or not) 2)What is effect? (best guess) 3

Problem and Solution Problem How to evaluate results of original and replication study? Solution Accurate estimation of effect size … … taking statistical significance of the original study into account 4

The Message (1)Methods should take statistical significance of original study into account (2)We developed such methods (frequentist and Bayesian) (3)Need huge sample sizes (N~1,000) to distinguish 0 from small effect  With current sample sizes in Psychology, one or two studies is not sufficient to accurately estimate effect size (4) Apply methods to Reproducibility Project (2015)  Best guess for only few nonsignificant replications is zero effect 5 Easy, natural, insightful

Overview 1.Publication bias and Reproducibility 2.Why we should take significance original study into account 3.Bayesian method 4.Analytical results Bayesian method 5.Application: Reproducibility Project Psychology 6.Conclusion and discussion 6

1. Publication bias and Reproducibility Publication bias is ‘the selective publication of studies with a statistically significant outcome’ 7 Evidence of publication bias is HUMONGOUS 97% of published original significant in psychology (97% in RPP), but average power much lower (8%, about 20%, 35%, 50%)  So… convinced?

1. Publication bias and Reproducibility Publication bias is the 800-lb gorilla in psychology’s living room (Ferguson and Heene, 2012) 8

1. Publication bias and Reproducibility But.. Psychologists do not see the gorilla ?!? ‘Shock’ after RPP (97%  36%) 9

2. Why we should take significance of original study into account Assume researcher’s goal: replicate significant original i.Selection of high score ii.Score subject to (sampling) error  Regression to the mean: original overestimates, replication accurate ! Holds irrespective of publication bias ! 10

3. Bayesian method [Snapshot Bayesian Hybrid Method] Assumptions –Original study is statistically significant –Both studies estimate the same effect (fixed-effect) –No questionable research practices Basic idea 1) Assume 4 effect sizes (0, small, medium, large [Cohen]) = snapshots 2) Compute posterior probability of four effects = Bayesian 3) Take statistical significance of original study into account = hybrid 11

3. Bayesian method Basic idea Likelihoods replication study 12

3. Bayesian method Basic idea Likelihoods original study 13

3. Bayesian method Basic idea Applied to example Maxwell et al. (2015) Evidence of 0 and small effect increased; best guess = small effect Advantages of method Easy, natural, insightful Easy (re)computation posterior for other (than uniform) prior 14

4. Analytical results Bayesian method Independent variables: ρ = 0; 0.1; 0.3; 0.5 [0, Small, Medium, Large] N both original and replication: 31; 55; 96, and 300, 1000 Dependent variables: Expected posterior probability Probability of strong evidence (posterior >.75 or Bayes Factor > 3) 15

4. Analytical results Bayesian method Expected posterior probability (hybrid) Need huge sample sizes (N~1,000) to distinguish 0 from small effect 16

4. Analytical results Bayesian method Expected posterior probability (WRONG method) Uncorrected for publication bias  overestimation 17

4. Analytical results Bayesian method Expected posterior probability (hybrid) Easier to distinguish medium and strong effect size 18

4. Analytical results Bayesian method Probability of strong evidence (hybrid) High sample size needed for 0 and small effect 19

5. Application: Reproducibility Project Psychology 100 studies from JPSP, Psych Science, JEP  67 could be included Evidence according to posterior probabilities >.25 0 = zero, S(mall), M(edium), L(arge) Strong evidence (posterior probability >.75) Only few studies have strong evidence for zero effect (13.4%) 20

6. Conclusion and discussion 21 Messages (1)Methods should take statistical significance of original study into account (2)We developed such methods (frequentist and Bayesian) (3)Need huge sample sizes (N~1,000) to distinguish 0 from small effect  With current sample sizes in Psychology, one or two studies is not sufficient to accurately estimate effect size (4) Apply methods to Reproducibility Project (2015)  Best guess for few nonsignificant replications is zero effect

6. Conclusion and discussion 22 Other Apply methods to reproducibility project economics Power analysis (how large should N of replication for 80% chance of strong evidence?) Unequal sample size original and replication discarding original studies, i.e. using only replication, is optimal for estimation in some conditions  Start all over again in some fields?!? App / user friendly program

Thank you for your attention 23

1. Publication bias and Reproducibility But.. Psychologists do not see the gorilla ?!? ‘Shock’ after RPP (97%  36%) Denial of results by some psychologist and methodologists Bad replication Not generalizable to other settings/people/time Statistical evaluation of results not right ( e.g. Maxwell et al.) ALL critics true to some extent !  NEED accurate methods! 24

2. Why we should take significance of original study into account Assume researcher’s goal: replicate significant original i.Selection of high score ii.Score subject to (sampling) error  Regression to the mean: expected value of replication is smaller than of original ! Holds irrespective of publication bias ! Assume researcher’s goal: replicate original (was sign.) No researcher’s selection of high score, but… Selection of high score through publication bias  regression to the mean still holds, and should still take significance original study into account 25

Other very important messages we would like to convey, but really have no time for it 1. Analysis shows that discarding original studies, i.e. using only replication, is optimal for estimation if... (i)true effect size is zero-small, and (ii)N replication > N original  Start all over again in some fields?!? 2. Using Bayesian analysis, or Confidence Intervals, rather than frequentist statistics is not the solution...  Using larger sample sizes is part of the solution 26