Improving Openness and Reproducibility of Scientific Research Tim Errington Center for Open Science http://cos.io/ We have an opportunity in front of us to make some real changes in how science is done. I’m going to talk about a slightly different approach to supporting data sharing requirements, and one which is made possible through collaborations and partnerships with others, including perhaps most prominently, those who manage and preserve scientific knowledge: publishers and librarians. COS is a non-profit technology company providing free and open services to increase inclusivity and transparency of research. COS supports shifting incentives and practices to align more closely with scientific values. Challenges I face when working to advance scientific knowledge and my career at the same time. And, how my scientific practices can be adapted to meet my scientific values.
Mission: Improve openness, integrity, and reproducibility of scientific research First, let me give you a bit of background on the Center for Open Science. COS is a non-profit technology start-up in Charlottesville, VA, with the mission of improving openness, integrity, and reproducibility of scientific research. Founded in March of 2013, we’ve grown to over 30 fulltime employees (and 25+ interns), working in 3 main areas: Infrastructure, community, and metascience. Most of what I’m going to talk about today falls under the community umbrella: bringing together communities of researchers, funders, editors, and other stakeholders interested in improving the reproducibility of research.
Norms Counternorms Communality Universalism Disinterestedness Open sharing Universalism Evaluate research on own merit Disinterestedness Motivated by knowledge and discovery Organized skepticism Consider all new evidence, even against one’s prior work Quality Secrecy Closed Particularlism Evaluate research by reputation Self-interestedness Treat science as a competition Organized dogmatism Invest career promoting one’s own theories, findings Quantity Communality – open sharing with colleagues; Secrecy Universalism – research evaluated only on its merit; Particularism – research evaluated by reputation/past productivity Disinterestedness – scientists motivated by knowledge and discovery, not by personal gain; self-interestedness – treat science as a competition with other scientists Organized skepticism – consider all new evidence, theory, data, even if it contradicts one’s prior work/point-of-view; organized dogmatism – invest career in promoting one’s own most important findings, theories, innovations Quality – seek quality contributions; Quantity – seek high volume Communality refers to the shared ownership of scientific methods and findings. Universalism is the principle that a scientist’s findings and body of work should be judged on the basis of their merit, without reference to the sci- entist’s personal or other irrelevant characteristics. Disinterestedness represents the understanding that sci- entists’ work should be free of self-interested motivation and pursuit of wealth. Organized skepticism requires that scientific findings be made available for scrutiny and that such scrutiny be performed in accordance with accepted scientific standards. Merton,1942;
Anderson, Martinson, & DeVries, 2007 3,247 mid- and early-career scientists who had research funding from NIH. ideal to which most scientists subscribe scientists perceptions of their own behavior scientists perceptions of their peer’s behaviors self-regulation, substantiall autonomy, the complexity of scientific projects, professional expertise, innovative work on cutting-edge problems, and a system of largely voluntary compliance with regulation and codes of ethics all point to the futility and inadvisability of direct administrative control over scientists’ behavior Anderson, Martinson, & DeVries, 2007
Central Features of Science Scientific Ideals Innovative ideas Reproducible results Accumulation of knowledge Transparency Reproducibility Central Features of Science Science operates under three basic ideals Science aims to make new discoveries. The findings of science should be reproducible. I should be able to find an effect multiple times, and other researchers should also be able to find the same effects. The findings of individual studies should be able to build off one another, with each study being a reliable piece of evidence that build towards some broader understanding of a true phenomena. Two central features of science are transparency and reproducibility (Bacon, 1267;1859; Jasny et al., 2011; Kuhn, 1962; Merton, 1942; Popper, 1934/1992). Transparency requires scientists to publish their methodology and data so that the merit of a claim can be assessed on the basis of the evidence rather than the reputation of those making the claim. Reproducibility can refer to both the ability of others to reproduce the findings, given the original data, and to the generation of new data that supports the same conclusions. If all published results were true and their effect sizes estimated precisely, then a singular focus on innovation over verification might be inconsequential, because the effect size would be reliable. In such a context, the most efficient means of knowledge accumulation would be to spend all resources on discovery and trust that each published result provided an accurate estimate of effects on which to build or extend. However, if not all published results are true and if effect sizes are misestimated, then an absence of replication and verification will lead to a published literature that misrepresents reality. The consequences of that scenario would depend on the magnitude of the mis-estimation.
Unfortunately, it has become apparently over the last few years that perhaps the answer to that question is not all that much. Now, there have been some very prominent cases in the past few years of outright fraud, where people have completely fabricated their data, but I’m not talking about those case. What I’m talking is the general sense that many scientific findings in a wide variety of fields don’t replicate, and that the published literature has a very high rate of false-positives in it. So if a large proportion of our published results aren’t replicable, and are potential false positives, Are we actually accumulating knowledge about real phenomena? I would suggest that hte answer to this question is ‘no’, or at least we haven’t accumulated as much knowledge as we would like to believe. These three ideals should be exemplified within the published literature, however there is growing evidence that much of the published literature may be less reliable than we would wish, leading to an inefficient accumulation of knowledge. This unreliability, the “reproducibility crisis” in science, is evidenced by almost daily articles and news stories describing research that has failed to replicate. One factor contributing to the low reliability of scientific findings is the lack of openness in science. Lack of transparency around the research process and the research product (data!), at best, leaves other researchers and replicators in the dark about research design and data analysis decisions. Providing access to data and materials allows others to reproduce your work and build on it. Without replication, false positives can persist in the literature - inhibiting scientific progress. It is important to note that what I am NOT talking about in these cases is fraud. Now, there have been some very prominent cases in the past few years of outright fraud, where people have completely fabricated their data, but I’m not talking about those cases. What I’m talking is the general sense that many scientific findings in a wide variety of fields don’t replicate, and that the published literature has a very high rate of false-positives in it. So if a large proportion of our published results aren’t replicable, and are potential false positives, Are we actually accumulating knowledge about real phenomena? I would suggest that the answer to this question is ‘no’, or at least we haven’t accumulated as much knowledge as we would like to believe. The Center for Open Science is dedicated to improving the reproducibility of scientific research -- but first let’s examine the challenges associated with that. The small amount of direct evidence about reproducibility converges with the conclusions of these systematic reviews. A survey of faculty and trainees at the MD Anderson Cancer Center found half of those researchers reported an inability to reproduce data on at least one occasion (Mobley et al., 2013). More dramatically, two industrial laboratories, Bayer and Amgen, reported reproducibility rates of 11% and 25% in two independent efforts to reproduce findings from dozens of groundbreaking basic science studies in oncology and related areas (Begley and Ellis, 2011; Prinz et al., 2011). The available evidence suggests that published research is less reproducible than assumed and desired, perhaps because of an inflation of false positives and a culture of incentives that values publication over accuracy (Nosek et al., 2012), but, the evidence is incomplete. The Bayer and Amgen reports of failing to reproduce a high proportion of results provide the most direct evidence. However, neither report made available the effects investigated, the sampling process, the methodology, or the data that comprised the replication efforts (Nature, 2012). Some evidence from bio-medical research suggests that this is occurring. Two different industrial laboratories attempted to replicate 40 or 50 basic science studies that showed positive evidence for markers for new cancer treatments or other issues in medicine. They did not select at random. Instead, they picked studies considered landmark findings. The success rates for replication were about 25% in one study and about 10% in the other. Further, some of the findings they could not replicate had spurred large literatures of hundreds of articles following up on the finding and its implications, but never having tested whether the evidence for the original finding was solid. This is a massive waste of resources. Across the sciences, evidence like this has spurred lots of discussion and proposed actions to improve research efficiency and avoid the massive waste of resources linked to erroneous results getting in and staying in the literature, and about the culture of scientific practices that is rewarding publishing, perhaps at the expense of knowledge building. There have been a variety of suggestions for what to do. For example, the Nature article on the right suggests that publishing standards should be increased for basic science research. [It is not in my interest to replicate – myself or others – to evaluate validity and improve precision in effect estimates (redundant). Replication is worth next to zero (Makel data on published replications; motivated to not call it replication; novelty is supreme – zero “error checking”; not in my interest to check my work, and not in your interest to check my work (let’s just each do our own thing and get rewarded for that) Irreproducible results will get in and stay in the literature (examples from bio-med). Prinz and Begley articles (make sure to summarize accurately) The Nature article by folks in bio-medicine is great. The solution they offer is a popular one in commentators from the other sciences -- raise publishing standards.
What is reproducibility? Computation Reproducibility: If we took your data and code/analysis scripts and reran it, we can reproduce the numbers/graphs in your paper Empirical Reproducibility: We have enough information to rerun the experiment or survey the way it was originally conducted Replicability: We use your exact methods and analyses, but collect new data, and we get the same statistical results
Why should you care? To increase the efficiency of your own work Hard to build off our own work, or work of others in our lab We may not have the knowledge we think we have Hard to even check this if reproducibility low
Problems Flexibility in analysis Selective reporting Ignoring nulls Lack of replication Examples from: Button et al – Neuroscience Ioannidis – why most results are false (Medicine) GWAS Biology Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Some evidence from bio-medical research suggests that this is occurring. Two different industrial laboratories attempted to replicate 40 or 50 basic science studies that showed positive evidence for markers for new cancer treatments or other issues in medicine. They did not select at random. Instead, they picked studies considered landmark findings. The success rates for replication were about 25% in one study and about 10% in the other. Further, some of the findings they could not replicate had spurred large literatures of hundreds of articles following up on the finding and its implications, but never having tested whether the evidence for the original finding was solid. This is a massive waste of resources. Across the sciences, evidence like this has spurred lots of discussion and proposed actions to improve research efficiency and avoid the massive waste of resources linked to erroneous results getting in and staying in the literature, and about the culture of scientific practices that is rewarding publishing, perhaps at the expense of knowledge building. There have been a variety of suggestions for what to do. For example, the Nature article on the right suggests that publishing standards should be increased for basic science research. [It is not in my interest to replicate – myself or others – to evaluate validity and improve precision in effect estimates (redundant). Replication is worth next to zero (Makel data on published replications; motivated to not call it replication; novelty is supreme – zero “error checking”; not in my interest to check my work, and not in your interest to check my work (let’s just each do our own thing and get rewarded for that) Irreproducible results will get in and stay in the literature (examples from bio-med). Prinz and Begley articles (make sure to summarize accurately) The Nature article by folks in bio-medicine is great. The solution they offer is a popular one in commentators from the other sciences -- raise publishing standards. Sterling, 1959; Cohen, 1962; Lykken, 1968; Tukey, 1969; Greenwald, 1975; Meehl, 1978; Rosenthal, 1979
Researcher Degrees of Freedom All data processing and analytical choices made after seeing and interacting with your data Should I collect more data? Which observations should I exclude? Which conditions should I compare? What should be my main DV?
http://compare-trials.org
Button et al., 2013, Nature Reviews Neuroscience Figure 3 | Median power of studies included in neuroscience meta-analyses. The figure shows a histogram of median study power calculated for each of the n = 49 meta-analyses included in our analysis, with the number of meta-analyses (N) on the left axis and percent of meta-analyses (%) on the right axis. There is a clear bimodal distribution; n = 15 (31%) of the meta-analyses comprised studies with median power of less than 11%, whereas n = 7 (14%) comprised studies with high average power in excess of 90%. Despite this bimodality, most meta-analyses comprised studies with low statistical power: n = 28 (57%) had median study power of less than 31%. The meta-analyses (n = 7) that comprised studies with high average power in excess of 90% had their broadly neurological subject matter in common. Simultaneously, across disciplines, the average power of studies to detect positive results is quite low (Button et al., 2013; Cohen, 1962; Ioannidis, 2005). In neuroscience, for example, Button et al. observed the median power of studies to be 21% (Button et al., 2013), which means that assuming the finding being investigated is true and accurately estimated, then only 21 of every 100 studies investigating that effect would detect statistically significant evidence for the effect. Most studies would miss detecting the true effect. The implication of very low power is that the research literature would be filled with lots of negative results, regardless of whether the effects actually exist or not. In the case of neuroscience, assuming all investigated effects in the published literature are true, only 21% of the studies should have obtained a significant, positive result detecting that effect. However, Fanelli observed a positive result rate of 85% in neuroscience (Fanelli, 2010). This discrepancy between observed power and observed positive results is not statistically possible. Instead, it suggests systematic exclusion of negative results (Greenwald, 1975) and possibly the exaggeration of positive results by employing flexibility in analytic and reporting practices that inflate the likelihood of false positives (Simmons et al., 2011). Button et al., 2013, Nature Reviews Neuroscience
There is evidence that our published literature is too good to be true. Daniele Fanelli did an analysis of what gets published across scientific disciplines and found that all disciplines had positive result rates of 70% or higher. From physics through psychology, the rates were 85-92%. Consider our field’s 92% positive result rate in comparison to the average power of published studies. Estimates suggest that the average psychology study has a power of somewhere around .5 to .6 to detect its effects. So, if all published results were true, we’d expect somewhere between 50-60% of the critical tests to reject the null hypothesis. But we get 92%. That does not compute. Something is askew in the accumulating evidence. [It is not in my interest to write up negative results, even if they are true, because they are less likely to be published (negative) – file-drawer effect] The accumulating evidence suggests an alarming degree of mis-estimation. Across disciplines, most published studies demonstrate positive results – results that indicate an expected association between variables or a difference between experimental conditions (Fanelli, 2010, 2012; Sterling, 1959). Fanelli observed a positive result rate of 85% in neuroscience (Fanelli, 2010). Fanelli,2010,
Disorganization Loss of Materials and Data Infrequent Sharing Problems Flexibility in analysis Selective reporting Ignoring nulls Lack of replication Disorganization Loss of Materials and Data Infrequent Sharing Examples from: Button et al – Neuroscience Ioannidis – why most results are false (Medicine) GWAS Biology Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Some evidence from bio-medical research suggests that this is occurring. Two different industrial laboratories attempted to replicate 40 or 50 basic science studies that showed positive evidence for markers for new cancer treatments or other issues in medicine. They did not select at random. Instead, they picked studies considered landmark findings. The success rates for replication were about 25% in one study and about 10% in the other. Further, some of the findings they could not replicate had spurred large literatures of hundreds of articles following up on the finding and its implications, but never having tested whether the evidence for the original finding was solid. This is a massive waste of resources. Across the sciences, evidence like this has spurred lots of discussion and proposed actions to improve research efficiency and avoid the massive waste of resources linked to erroneous results getting in and staying in the literature, and about the culture of scientific practices that is rewarding publishing, perhaps at the expense of knowledge building. There have been a variety of suggestions for what to do. For example, the Nature article on the right suggests that publishing standards should be increased for basic science research. [It is not in my interest to replicate – myself or others – to evaluate validity and improve precision in effect estimates (redundant). Replication is worth next to zero (Makel data on published replications; motivated to not call it replication; novelty is supreme – zero “error checking”; not in my interest to check my work, and not in your interest to check my work (let’s just each do our own thing and get rewarded for that) Irreproducible results will get in and stay in the literature (examples from bio-med). Prinz and Begley articles (make sure to summarize accurately) The Nature article by folks in bio-medicine is great. The solution they offer is a popular one in commentators from the other sciences -- raise publishing standards. Sterling, 1959; Cohen, 1962; Lykken, 1968; Tukey, 1969; Greenwald, 1975; Meehl, 1978; Rosenthal, 1979
Challenges Reproducing prior results is challenging because of insufficient, incomplete, or inaccurate reporting of methodologies (Hess, 2011; Prinz et al., 2011; Steward et al., 2012; Hackam and Redelmeier, 2006, Landis et al., 2011). Further, a lack of information about research resources makes it difficult or impossible to determine what was used in a published study (Vasilevsky et al., 2013). These challenges are compounded by the lack of funding support available from agencies and foundations to support replication research. Finally, reproducing analyses with prior data is difficult because researchers are often reluctant to share data, even when required by funding bodies or scientific societies (Wicherts et al., 2006), and because data loss increases rapidly with time after publication (Vines et al., 2014). Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Manufacturing beauty: Flexibility in Analysis Selective Reporting Presenting Exploratory as Confirmatory Selective Reporting: File drawer phenomenon In a study published in August a team at Stanford traced the publication outcomes of 221 survey-based experiments funded by the NSF. Nearly 2/3 of the social science experiments that produced null results, those that did not support a hypothesis, were simply filed away. In contrast, researchers wrote up 96% of the studies with statistically strong results. Vasilevsky, 2013; Vines,2014
The larger projects are the Reproducibility Projects that empirically examine the rate and predictors of reproducibility in the published literature. These are our flagship projects with one investigating psychological science, ‘RP:P’, and the other cancer biology, ‘RP:CB’. The projects are the same in overall design, which use a sample of the published literature and perform direct replications, working with the original authors, when possible, to obtain all original materials and methods, performing high-powered replications, and evaluating the feasibility of conducting direct replications and covariates of replication success. Even though the overall design and aims are the same, the projects have adopted different executions because of disciplines with RP:P utilizing a large collaborative network of hundreds of independent researchers to design and conduct the replications (a true community effort), while RP:CB has volunteers draft replication protocols from the original paper, but uses paid biomedical laboratories that are part of the Science Exchange network to conduct the replications. Both projects are ongoing, although at different stages.
97% 37% xx Open Science Collaboration, 2015, Science
Open Science Collaboration, 2015, Science
https://osf.io/e81xl/
http://sitn. hms. harvard http://sitn.hms.harvard.edu/flash/2015/reproduce-or-bust-bringing-reproducibility-back-to-center-stage/
Incentives for individual success are focused on getting it published, not getting it right These perceived norms are a challenge to openness in scientific research. So, too is motivation: What am I incentivized to do? Publish or perish - less so to get it right. What you know about my research process is what is published in the report, nothing else. Lack of transparency makes me less accountable. And finally - I’m busy, and changing my workflow to adopt better practices takes time - I’ll just do the best I can and hope it’s okay. Despite being a defining feature of science, reproducibility is more an assumption than a practice in the present scientific ecosystem (Collins, 1985; Schmidt, 2009). Incentives for scientific achievement prioritize innovation over replication (Alberts et al., 2014; Nosek, et al., 2012). Peer review tends to favor manuscripts that contain new findings over those that improve our understanding of a previously published finding. Moreover, careers are made by producing exciting new results at the frontiers of knowledge, not by verifying prior discoveries. Nosek, Spies, & Motyl, 2012
Technology to enable change Training to enact change Incentives to embrace change Improving scientific ecosystem
538 Journals 58 Organizations http://cos.io/top
TOP Guidelines Data citation Design transparency Research materials transparency Data transparency Analytic methods (code) transparency Preregistration of studies Preregistration of analysis plans Replication
Low barrier to entry Modular Agnostic to discipline TOP Design Principles Low barrier to entry Modular Agnostic to discipline
Data sharing 1 2 3 Article states whether data are available, and, if so, where to access them Data must be posted to a trusted repository. Exceptions must be identified at article submission. Data must be posted to a trusted repository, and reported analyses will be reproduced independently prior to publication.
Why you might want to share Journal/Funder mandates Increase impact of work Recognition of good research practices
Signals: Making Behaviors Visible Promotes Adoption Badges Open Data Open Materials Preregistration Psychological Science (Jan 2014)
40% 30% % Articles reporting that data was available 20% 10% 0%
% of Articles reporting that data was available 100% 75% % of Articles reporting that data was available 50% 25% 0% Reportedly available Available Correct Data Usable Data Complete Data
Solution: Pre-registration Before data is collected, specify The what of the study Research question Population Primary outcome General design Pre-analysis plan Information on exact analysis that will be conducted Sample size Data processing and cleaning procedures Exclusion criterion Statistical Analyses Registered in a read-only format and time-stamped Decreases researcher degrees of freedom, so smaller chance of obtaining a false-positive through data-based decisions Combats publication bias and selective reporting Registration holds you accountable to your self and to others
Positive Result Rate dropped from 57% to 8% after preregistration required.
The Pre-registration Challenge Educational content and the OSF workflow can be seen by going to https://cos.io/preregprivate (will go to staging server Wednesday/Thursday). Promotional video is here: https://youtu.be/SWkqdNppL-s One thousand scientists will win $1,000 each for publishing the results of their preregistered research. https://cos.io/prereg
Registered Reports PEER REVIEW Design Collect & Analyze Report Publish Review of intro and methods prior to data collection; published regardless of outcome
Who Publishes Registered Reports? So.. who publishes these things? Here’s a partial (and growing!) list. You can view the complete list on the Registered Reports project page on the OSF. There’s even a table comparing features of RRs across journals. Review of intro and methods prior to data collection; published regardless of outcome Beauty vs. accuracy of reporting Publishing negative results Conducting replications Peer review focuses on quality of methods (just to name a few) See the full list and compare features: osf.io/8mpji
Disorganization Loss of Materials and Data Infrequent Sharing Problems Flexibility in analysis Selective reporting Ignoring nulls Lack of replication Disorganization Loss of Materials and Data Infrequent Sharing Examples from: Button et al – Neuroscience Ioannidis – why most results are false (Medicine) GWAS Biology Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Some evidence from bio-medical research suggests that this is occurring. Two different industrial laboratories attempted to replicate 40 or 50 basic science studies that showed positive evidence for markers for new cancer treatments or other issues in medicine. They did not select at random. Instead, they picked studies considered landmark findings. The success rates for replication were about 25% in one study and about 10% in the other. Further, some of the findings they could not replicate had spurred large literatures of hundreds of articles following up on the finding and its implications, but never having tested whether the evidence for the original finding was solid. This is a massive waste of resources. Across the sciences, evidence like this has spurred lots of discussion and proposed actions to improve research efficiency and avoid the massive waste of resources linked to erroneous results getting in and staying in the literature, and about the culture of scientific practices that is rewarding publishing, perhaps at the expense of knowledge building. There have been a variety of suggestions for what to do. For example, the Nature article on the right suggests that publishing standards should be increased for basic science research. [It is not in my interest to replicate – myself or others – to evaluate validity and improve precision in effect estimates (redundant). Replication is worth next to zero (Makel data on published replications; motivated to not call it replication; novelty is supreme – zero “error checking”; not in my interest to check my work, and not in your interest to check my work (let’s just each do our own thing and get rewarded for that) Irreproducible results will get in and stay in the literature (examples from bio-med). Prinz and Begley articles (make sure to summarize accurately) The Nature article by folks in bio-medicine is great. The solution they offer is a popular one in commentators from the other sciences -- raise publishing standards. Sterling, 1959; Cohen, 1962; Lykken, 1968; Tukey, 1969; Greenwald, 1975; Meehl, 1978; Rosenthal, 1979
http://osf.io/ free, open source Share data, share materials, show the research process – confirmatory result make it clear, exploratory discovery make it clear; demonstrate the ingenuity, perspiration, and learning across false starts, errant procedures, and early hints – doesn’t have to be written in painstaking detail in the final report, just make it available. http://osf.io/ free, open source
Workflow – level of abstraction that is shared across areas of scholarly inquiry, particularly the sciences Functions that support that workflow
OpenSesame
https://osf.io/rwtyf/ Find this presentation at https://osf.io/rwtyf/ Questions: tim@cos.io
Resources: Reproducible Research Practices? The OSF? stats-consulting@cos.io The OSF? support@osf.io Have feedback for how we could support you more? contact@cos.io feedback@cos.io