The P-hacking Phenomenon

Slides:



Advertisements
Similar presentations
CHAPTER 23: Two Categorical Variables: The Chi-Square Test
Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 13 Experiments and Observational Studies.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Experiments and Observational Studies.  A study at a high school in California compared academic performance of music students with that of non-music.
Copyright © 2010 Pearson Education, Inc. Chapter 13 Experiments and Observational Studies.
Experiments and Observational Studies. Observational Studies In an observational study, researchers don’t assign choices; they simply observe them. look.
Chapter 13 Observational Studies & Experimental Design.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 13 Experiments and Observational Studies.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Evidence Based Medicine Meta-analysis and systematic reviews Ross Lawrenson.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 4): Power Fall, 2008.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
EBM --- Journal Reading Presenter :呂宥達 Date : 2005/10/27.
Copyright © 2009 Pearson Education, Inc. Chapter 13 Experiments and Observational Studies.
Lecture Notes and Electronic Presentations, © 2013 Dr. Kelly Significance and Sample Size Refresher Harrison W. Kelly III, Ph.D. Lecture # 3.
Inference for a Single Population Proportion (p)
SAT Reading Strategies.
Analysis of AP Exam Scores
Chapter 8: Estimating with Confidence
Step 1: Specify a null hypothesis
CHAPTER 9 Testing a Claim
Chapter 8: Estimating with Confidence
Experimental Psychology
Unit 5: Hypothesis Testing
Preregistration Challenge
Warm Up Check your understanding P. 586 (You have 5 minutes to complete) I WILL be collecting these.
Chapter 21 More About Tests.
Testing Hypotheses About Proportions
CHAPTER 9 Testing a Claim
Examples of testing the mean and the proportion with single samples
Research design I: Experimental design and quasi-experimental research
4.2 Day
Observational Studies and Experiments
Meta-Analytic Thinking
Chapter 13- Experiments and Observational Studies
Experiments and Observational Studies
Office of Education Improvement and Innovation
Experiments and Observational Studies
Introduction to Summary Statistics
Introduction to Summary Statistics
Testing Hypotheses about Proportions
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Inferential Statistics
Designing Experiments
CHAPTER 9 Testing a Claim
CHAPTER 18: Inference in Practice
Inference for Proportions
SAT Reading Strategies.
Starter: Descriptive Statistics
CHAPTER 9: Producing Data— Experiments
Chapter 4: Designing Studies
Significance Tests: The Basics
Testing Hypotheses About Proportions
Statistical Reasoning December 8, 2015 Chapter 6.2
Chapter 12 Power Analysis.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Psych 231: Research Methods in Psychology
CHAPTER 9 Testing a Claim
SAT Reading STRATEGIES.
Chapter 4: Designing Studies
8.3 Estimating a Population Mean
CHAPTER 16: Inference in Practice
CHAPTER 9 Testing a Claim
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Inference for Distributions of Categorical Data
Intro to Inference & The Central Limit Theorem
Presentation transcript:

The P-hacking Phenomenon Megan Head, Luke Holman, Andrew Kahn, Rob Lanfear, Michael Jennions Ok, so I’m going to talk about a study that my colleagues here and I recently published in PLoS Biology about publication bias and p-hacking Before I get started though, I first want to ask you a few questions.

Have you ever…. Not reported variables that you’ve measured? Decided whether to collect more data or not after looking at your results? Back engineered your predictions? So don’t worry I’m not going to make you raise your hands, I just want you to answer these questions in your head Have you ever not reported variables that you’ve measured… perhaps because after analysing the data it was clear that some of them weren’t very important… Or have you ever decided whether to collect more data after analysing your results? Perhaps because it became clear a greater sample size would give you more power to detect a marginal effect… Or have you ever back engineered your predictions to tell a good story? Either changing the focus of your paper to match the most exciting findings or reporting unexpected findings as having been predicted from the start….

* If you said yes to any of these questions you aren’t alone… In this study John and colleagues found that over 60% of psychologists surveyed have failed to report all dependent variables, over 55% have decided whether to collect more data after looking at the results and around 30% have back engineered their predictions. They also show that these practices are generally deemed acceptable by researchers. And I must admit, that prior to doing this research I would probably have been amongst them. But what I want to do today is demonstrate that, taking part in seemingly innocuous practices like these can actually be quite detrimental to scientific progress. * Score of 2 means all researchers thought the practice was defensible John et al 2012

The Replication Crisis So… I don’t know if you’ve heard, but some fields of science are undergoing what called a “replication crisis” – that is, many published results are unable to be reproduced. This inability to replicate findings is bad, because reproducibility is an essential part of the scientific method. But I don’t think it’s necessarily being caused by intentional misconduct, but rather by the way we do our science and unintentional biases that we introduce into our work.

Publication Bias Types of publication bias The file drawer effect p-hacking One such bias is publication bias So there are two types of publication bias: There’s selection bias, also known as the “file drawer effect”. This is a bias in which research is published, and which isn’t that arises because journal editors and reviewers place a higher value on significant findings. Because of this and because researchers are judged based on the number of papers they have and the prestige of the journals they publish in, they often don’t publish studies that yield non-significant results. When later reviewing the literature missing non-significant results can lead to an overestimation of the true effect of a treatment or relationship. P-hacking, is a little bit different but probably arises at least in part for the same reasons. P-hacking is the misreporting of true effect sizes in published studies. It occurs when researchers analyse their data multiple times or in multiple ways and then selectively report those analyses that produce significant results. Like the file drawer effect p-hacking can lead to an overestimation of the true effect size, but in this case it’s not because effect sizes aren’t being published, but because individual effect sizes that are being published are inflated.

How to p-hack Recording many response variables and deciding which to report post analysis Study 1 Study 2 Study 4 Study 3 Study 6 Study 5 Study 8 Study 7 Study 9 Effect size So let me give an example of how this works using a practice that seems common and deemed by most as acceptable. Recording many response variables and deciding which to report post analysis So let’s say researchers are very interested in understanding how body size is influenced by temperature. Each study estimates body size in a few different ways, simply because they can. When they relate each body size measurements to temperature they get a bunch of different effect sizes, that these blue data points. If these studies reported all of the effects they recorded you would see a mean effect size across studies that is here – around 0.5. If they all reported a random one of these variables or decided before the experiment just to measure one you would get a similar mean effect size, however if all these studies report only the variable that gives them the greatest effect, that is these points over here, then in the published research it will actually look like the true effect size is here around 0.7.

What is the problem with p-hacking? And this is a big problem for people coming along later who want to weigh up all the evidence and come to some kind of general conclusion about an effect, because if the p-values and effect sizes presented in the literature are a biased sample of what is out there, reviews and meta-analyses will overestimate the strength of a relationship and this could influence policy and decision making.

How to detect p-hacking So clearly it is important to prevent p-hacking, but before investing too much time into this it’s probably a good idea to see if p-hacking is really happening. So how can we detect p-hacking? We’ll never be able to know whether a particular p-value has been hacked or not, but we can look at the distribution of a bunch of p-values and look for anomalies in this distribution to see whether it is likely that p-hacking is going on. So just to demonstrate the kind of anomalies I’m talking about… So on the left… this is what we expect the distribution of p-values to look like if the true effect size is zero, every p-value is equally likely to be observed and the expected distribution of p-values is uniform. On the other hand, when the true effect size is nonzero, the expected distribution of p-values is exponential with a right skew like this. This is because researchers are more likely to obtain lower p-values when studying strong effects. If researchers p-hack and turn a truly non-significant result into a significant one, then what we get is an overabundance of p-values just below 0.05. when there is no real effect… the distribution shifts from being flat to left skewed. And when there is a true effect, the distribution is still right skewed but now with a little hump just below 0.05.

How to detect p-hacking And using this knowledge we can test for two things… First to test for whether there is evidence for a non-zero effect size, we can use a binomial test to see whether there are more p-values in this part of the distribution compared to this part. Second to test for p-hacking, we can use a binomial test to see whether there are more p-values here, than here.

How widespread is p-hacking? Text-mining of open access papers in Pubmed Analysed one p-value per results section Classified according to FOR code So using these tests we looked at how widespread p-hacking is. To do this we used text mining to extract p-values from the results section of all open access papers available in the pubmed database. We then randomly selected one p-value per paper to ensure they were all independent and we assigned each p-value to a scientific discipline.

How widespread is p-hacking? What we found is that there is evidence that researchers are predominantly studying questions with non zero effect sizes. This is reassuring, particularly given recent concerns about the lack of reproducibility of findings. Across all disciplines, however there was also strong evidence for p-hacking. And when we look at the disciplines individually, in most, there were more p-values in the upper than the lower bin and in every discipline where we had good statistical power this difference was significant. Overall <0.001 <0.001

How widespread is p-hacking? So our text-mining suggests that p-hacking is widespread, but as you can see from the effect sizes presented here for each discipline the proportion of p values in the right hand bin often isn’t that much greater than 50% so although p-hacking is occurring it’s effects on general conclusions may be expected to be negligible. Proportion of p-values in the upper bin ± 95%CI

Does p-hacking affect meta-analyses? Evolutionary biology as a case study Re-extracted p-values from original papers So, in addition to looking at how widespread p-hacking is, we also wanted to have a closer look at how p-hacking influences meta-analyses. To do this we re-extracted p-values from papers that had been subject to previous meta-analyses as a kind of case study.

Does p-hacking affect meta-analyses? Overall we again found strong evidence that these meta-analyses are focusing on questions for which there appear to be real effects. Looking at individual studies their were 3 studies that didn’t, but these all had very low sample sizes so it is likely that this is a power issue. When we look at p-hacking ,we again found a significant effect across all the studies, But for individual studies this affect was only significant for the study with the largest sample size Overall <0.001 0.033

How can we stop p-hacking? What can researchers do? Educate themselves and others Clearly label research as pre-specified or exploratory Adhere to common analysis standards Perform blind analyses whenever possible Place greater emphasis on quality of methods than novelty of findings So in summary, our results suggest that p-hacking is widespread, and it is having noticeable effects on the distribution of p-values although small sample sizes can make p-hacking difficult to detect in individual meta-analyses. So now that we know that p-hacking is occurring what can we do to prevent it. One key thing is simply better education of researchers. As is clear from the research I presented earlier by John et al, many researchers don’t realise that the things they do are problematic • We can also clearly label research as prespecified or exploratory so that readers can treat results with appropriate caution. Exploratory research can be useful for identifying fruitful research directions, but prespecified studies offer far more convincing evidence for a specific effect. • We can adhere to common analysis standards; for instance measuring only response variables that are known (or predicted) to be important and concentrating research efforts on increasing sample sizes rather than measuring more variables. • We can perform data analyses blind whenever possible. Since not knowing which treatment is which makes it difficult to p-hack. • and finally and perhaps most importantly we can place greater emphasis on the quality of research methods and data collection rather than the significance or novelty of the subsequent findings when reviewing or assessing research.

Thank you! Thank you