Improving Openness and Reproducibility of Scientific Research Tim Errington Center for Open Science http://cos.io/ We have an opportunity in front of us to make some real changes in how science is done. I’m going to talk about a slightly different approach to supporting data sharing requirements, and one which is made possible through collaborations and partnerships with others, including perhaps most prominently, those who manage and preserve scientific knowledge: publishers and librarians. COS is a non-profit technology company providing free and open services to increase inclusivity and transparency of research. COS supports shifting incentives and practices to align more closely with scientific values. Challenges I face when working to advance scientific knowledge and my career at the same time. And, how my scientific practices can be adapted to meet my scientific values.
Mission: Improve openness, integrity, and reproducibility of scientific research First, let me give you a bit of background on the Center for Open Science. COS is a non-profit technology start-up in Charlottesville, VA, with the mission of improving openness, integrity, and reproducibility of scientific research. Founded in March of 2013, we’ve grown to over 30 fulltime employees (and 25+ interns), working in 3 main areas: Infrastructure, community, and metascience. Most of what I’m going to talk about today falls under the community umbrella: bringing together communities of researchers, funders, editors, and other stakeholders interested in improving the reproducibility of research.
Norms Counternorms Communality Universalism Disinterestedness Open sharing Universalism Evaluate research on own merit Disinterestedness Motivated by knowledge and discovery Organized skepticism Consider all new evidence, even against one’s prior work Quality Secrecy Closed Particularlism Evaluate research by reputation Self-interestedness Treat science as a competition Organized dogmatism Invest career promoting one’s own theories, findings Quantity Communality – open sharing with colleagues; Secrecy Universalism – research evaluated only on its merit; Particularism – research evaluated by reputation/past productivity Disinterestedness – scientists motivated by knowledge and discovery, not by personal gain; self-interestedness – treat science as a competition with other scientists Organized skepticism – consider all new evidence, theory, data, even if it contradicts one’s prior work/point-of-view; organized dogmatism – invest career in promoting one’s own most important findings, theories, innovations Quality – seek quality contributions; Quantity – seek high volume Communality refers to the shared ownership of scientific methods and findings. Universalism is the principle that a scientist’s findings and body of work should be judged on the basis of their merit, without reference to the sci- entist’s personal or other irrelevant characteristics. Disinterestedness represents the understanding that sci- entists’ work should be free of self-interested motivation and pursuit of wealth. Organized skepticism requires that scientific findings be made available for scrutiny and that such scrutiny be performed in accordance with accepted scientific standards. Merton,1942;
Anderson, Martinson, & DeVries, 2007 3,247 mid- and early-career scientists who had research funding from NIH. ideal to which most scientists subscribe scientists perceptions of their own behavior scientists perceptions of their peer’s behaviors self-regulation, substantiall autonomy, the complexity of scientific projects, professional expertise, innovative work on cutting-edge problems, and a system of largely voluntary compliance with regulation and codes of ethics all point to the futility and inadvisability of direct administrative control over scientists’ behavior Anderson, Martinson, & DeVries, 2007
Barriers Perceived norms (Anderson, Martinson, & DeVries, 2007) Motivated reasoning (Kunda, 1990) Minimal accountability (Lerner & Tetlock, 1999) Concrete rewards beat abstract principles (Trope & Liberman, 2010) I am busy (Me & You, 2016) We can understand the nature of the challenge with existing psychological theory. For example: 1. I have beliefs, ideologies, and achievement motivations that influence how I interpret and report my research (motivated reasoning; Kunda, 1990). And, even if I am trying to resist this motivated reasoning. I may simply be unable to detect it in myself, even when I can see those biases in others. 2. And, what biases might influence me. Well, pick your favorite. My favorite in this context is the hindsight bias. 3. What’s more is we face these potential biases in a context of minimal accountability. What you know of my laboratory work is only what you get in the published report. … 4. The goals and rewards of publishing are immediate and concrete; the rewards of getting it right are distal and abstract (Trope & Liberman) 5. Finally, even if I am prepared to accept that I have these biases and am motivated to address them so that I can get it right. I am busy. So are you. If I introduce a whole bunch of new things that I must now do to check and correct for my biases, I will kill my productivity and that of my collaborators. So, the incentives lead me to think that my best course of action is to just to the best I can and hope that I’m doing it okay.
Incentives for individual success are focused on getting it published, not getting it right These perceived norms are a challenge to openness in scientific research. So, too is motivation: What am I incentivized to do? Publish or perish - less so to get it right. What you know about my research process is what is published in the report, nothing else. Lack of transparency makes me less accountable. And finally - I’m busy, and changing my workflow to adopt better practices takes time - I’ll just do the best I can and hope it’s okay. Despite being a defining feature of science, reproducibility is more an assumption than a practice in the present scientific ecosystem (Collins, 1985; Schmidt, 2009). Incentives for scientific achievement prioritize innovation over replication (Alberts et al., 2014; Nosek, et al., 2012). Peer review tends to favor manuscripts that contain new findings over those that improve our understanding of a previously published finding. Moreover, careers are made by producing exciting new results at the frontiers of knowledge, not by verifying prior discoveries. Nosek, Spies, & Motyl, 2012
Central Features of Science Scientific Ideals Innovative ideas Reproducible results Accumulation of knowledge Transparency Reproducibility Central Features of Science Science operates under three basic ideals Science aims to make new discoveries. The findings of science should be reproducible. I should be able to find an effect multiple times, and other researchers should also be able to find the same effects. The findings of individual studies should be able to build off one another, with each study being a reliable piece of evidence that build towards some broader understanding of a true phenomena. Two central features of science are transparency and reproducibility (Bacon, 1267;1859; Jasny et al., 2011; Kuhn, 1962; Merton, 1942; Popper, 1934/1992). Transparency requires scientists to publish their methodology and data so that the merit of a claim can be assessed on the basis of the evidence rather than the reputation of those making the claim. Reproducibility can refer to both the ability of others to reproduce the findings, given the original data, and to the generation of new data that supports the same conclusions. If all published results were true and their effect sizes estimated precisely, then a singular focus on innovation over verification might be inconsequential, because the effect size would be reliable. In such a context, the most efficient means of knowledge accumulation would be to spend all resources on discovery and trust that each published result provided an accurate estimate of effects on which to build or extend. However, if not all published results are true and if effect sizes are misestimated, then an absence of replication and verification will lead to a published literature that misrepresents reality. The consequences of that scenario would depend on the magnitude of the mis-estimation.
What is reproducibility? Computation Reproducibility: If we took your data and code/analysis scripts and reran it, we can reproduce the numbers/graphs in your paper Empirical Reproducibility: We have enough information to rerun the experiment or survey the way it was originally conducted Replicability: We use your exact methods and analyses, but collect new data, and we get the same statistical results
Why should we care? To increase the efficiency of your own work Hard to build off our own work, or work of others in our lab We may not have the knowledge we think we have Hard to even check this if reproducibility low
Problems Flexibility in analysis Selective reporting Ignoring nulls Lack of replication Examples from: Button et al – Neuroscience Ioannidis – why most results are false (Medicine) GWAS Biology Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Some evidence from bio-medical research suggests that this is occurring. Two different industrial laboratories attempted to replicate 40 or 50 basic science studies that showed positive evidence for markers for new cancer treatments or other issues in medicine. They did not select at random. Instead, they picked studies considered landmark findings. The success rates for replication were about 25% in one study and about 10% in the other. Further, some of the findings they could not replicate had spurred large literatures of hundreds of articles following up on the finding and its implications, but never having tested whether the evidence for the original finding was solid. This is a massive waste of resources. Across the sciences, evidence like this has spurred lots of discussion and proposed actions to improve research efficiency and avoid the massive waste of resources linked to erroneous results getting in and staying in the literature, and about the culture of scientific practices that is rewarding publishing, perhaps at the expense of knowledge building. There have been a variety of suggestions for what to do. For example, the Nature article on the right suggests that publishing standards should be increased for basic science research. [It is not in my interest to replicate – myself or others – to evaluate validity and improve precision in effect estimates (redundant). Replication is worth next to zero (Makel data on published replications; motivated to not call it replication; novelty is supreme – zero “error checking”; not in my interest to check my work, and not in your interest to check my work (let’s just each do our own thing and get rewarded for that) Irreproducible results will get in and stay in the literature (examples from bio-med). Prinz and Begley articles (make sure to summarize accurately) The Nature article by folks in bio-medicine is great. The solution they offer is a popular one in commentators from the other sciences -- raise publishing standards. Sterling, 1959; Cohen, 1962; Lykken, 1968; Tukey, 1969; Greenwald, 1975; Meehl, 1978; Rosenthal, 1979
Researcher Degrees of Freedom All data processing and analytical choices made after seeing and interacting with your data Should I collect more data? Which observations should I exclude? Which conditions should I compare? What should be my main DV?
http://compare-trials.org
Button et al., 2013, Nature Reviews Neuroscience Figure 3 | Median power of studies included in neuroscience meta-analyses. The figure shows a histogram of median study power calculated for each of the n = 49 meta-analyses included in our analysis, with the number of meta-analyses (N) on the left axis and percent of meta-analyses (%) on the right axis. There is a clear bimodal distribution; n = 15 (31%) of the meta-analyses comprised studies with median power of less than 11%, whereas n = 7 (14%) comprised studies with high average power in excess of 90%. Despite this bimodality, most meta-analyses comprised studies with low statistical power: n = 28 (57%) had median study power of less than 31%. The meta-analyses (n = 7) that comprised studies with high average power in excess of 90% had their broadly neurological subject matter in common. Simultaneously, across disciplines, the average power of studies to detect positive results is quite low (Button et al., 2013; Cohen, 1962; Ioannidis, 2005). In neuroscience, for example, Button et al. observed the median power of studies to be 21% (Button et al., 2013), which means that assuming the finding being investigated is true and accurately estimated, then only 21 of every 100 studies investigating that effect would detect statistically significant evidence for the effect. Most studies would miss detecting the true effect. The implication of very low power is that the research literature would be filled with lots of negative results, regardless of whether the effects actually exist or not. In the case of neuroscience, assuming all investigated effects in the published literature are true, only 21% of the studies should have obtained a significant, positive result detecting that effect. However, Fanelli observed a positive result rate of 85% in neuroscience (Fanelli, 2010). This discrepancy between observed power and observed positive results is not statistically possible. Instead, it suggests systematic exclusion of negative results (Greenwald, 1975) and possibly the exaggeration of positive results by employing flexibility in analytic and reporting practices that inflate the likelihood of false positives (Simmons et al., 2011). Button et al., 2013, Nature Reviews Neuroscience
There is evidence that our published literature is too good to be true. Daniele Fanelli did an analysis of what gets published across scientific disciplines and found that all disciplines had positive result rates of 70% or higher. From physics through psychology, the rates were 85-92%. Consider our field’s 92% positive result rate in comparison to the average power of published studies. Estimates suggest that the average psychology study has a power of somewhere around .5 to .6 to detect its effects. So, if all published results were true, we’d expect somewhere between 50-60% of the critical tests to reject the null hypothesis. But we get 92%. That does not compute. Something is askew in the accumulating evidence. [It is not in my interest to write up negative results, even if they are true, because they are less likely to be published (negative) – file-drawer effect] The accumulating evidence suggests an alarming degree of mis-estimation. Across disciplines, most published studies demonstrate positive results – results that indicate an expected association between variables or a difference between experimental conditions (Fanelli, 2010, 2012; Sterling, 1959). Fanelli observed a positive result rate of 85% in neuroscience (Fanelli, 2010). Fanelli,2010,
Disorganization Loss of Materials and Data Infrequent Sharing Problems Flexibility in analysis Selective reporting Ignoring nulls Lack of replication Disorganization Loss of Materials and Data Infrequent Sharing Examples from: Button et al – Neuroscience Ioannidis – why most results are false (Medicine) GWAS Biology Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Some evidence from bio-medical research suggests that this is occurring. Two different industrial laboratories attempted to replicate 40 or 50 basic science studies that showed positive evidence for markers for new cancer treatments or other issues in medicine. They did not select at random. Instead, they picked studies considered landmark findings. The success rates for replication were about 25% in one study and about 10% in the other. Further, some of the findings they could not replicate had spurred large literatures of hundreds of articles following up on the finding and its implications, but never having tested whether the evidence for the original finding was solid. This is a massive waste of resources. Across the sciences, evidence like this has spurred lots of discussion and proposed actions to improve research efficiency and avoid the massive waste of resources linked to erroneous results getting in and staying in the literature, and about the culture of scientific practices that is rewarding publishing, perhaps at the expense of knowledge building. There have been a variety of suggestions for what to do. For example, the Nature article on the right suggests that publishing standards should be increased for basic science research. [It is not in my interest to replicate – myself or others – to evaluate validity and improve precision in effect estimates (redundant). Replication is worth next to zero (Makel data on published replications; motivated to not call it replication; novelty is supreme – zero “error checking”; not in my interest to check my work, and not in your interest to check my work (let’s just each do our own thing and get rewarded for that) Irreproducible results will get in and stay in the literature (examples from bio-med). Prinz and Begley articles (make sure to summarize accurately) The Nature article by folks in bio-medicine is great. The solution they offer is a popular one in commentators from the other sciences -- raise publishing standards. Sterling, 1959; Cohen, 1962; Lykken, 1968; Tukey, 1969; Greenwald, 1975; Meehl, 1978; Rosenthal, 1979
Unique identification of research resources in the biomedical literature Reproducing prior results is challenging because of insufficient, incomplete, or inaccurate reporting of methodologies (Hess, 2011; Prinz et al., 2011; Steward et al., 2012; Hackam and Redelmeier, 2006, Landis et al., 2011). Further, a lack of information about research resources makes it difficult or impossible to determine what was used in a published study (Vasilevsky et al., 2013). These challenges are compounded by the lack of funding support available from agencies and foundations to support replication research. Finally, reproducing analyses with prior data is difficult because researchers are often reluctant to share data, even when required by funding bodies or scientific societies (Wicherts et al., 2006), and because data loss increases rapidly with time after publication (Vines et al., 2014). Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Manufacturing beauty: Flexibility in Analysis Selective Reporting Presenting Exploratory as Confirmatory Selective Reporting: File drawer phenomenon In a study published in August a team at Stanford traced the publication outcomes of 221 survey-based experiments funded by the NSF. Nearly 2/3 of the social science experiments that produced null results, those that did not support a hypothesis, were simply filed away. In contrast, researchers wrote up 96% of the studies with statistically strong results. Vasilevsky, 2013
Challenges Reproducing prior results is challenging because of insufficient, incomplete, or inaccurate reporting of methodologies (Hess, 2011; Prinz et al., 2011; Steward et al., 2012; Hackam and Redelmeier, 2006, Landis et al., 2011). Further, a lack of information about research resources makes it difficult or impossible to determine what was used in a published study (Vasilevsky et al., 2013). These challenges are compounded by the lack of funding support available from agencies and foundations to support replication research. Finally, reproducing analyses with prior data is difficult because researchers are often reluctant to share data, even when required by funding bodies or scientific societies (Wicherts et al., 2006), and because data loss increases rapidly with time after publication (Vines et al., 2014). Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Manufacturing beauty: Flexibility in Analysis Selective Reporting Presenting Exploratory as Confirmatory Selective Reporting: File drawer phenomenon In a study published in August a team at Stanford traced the publication outcomes of 221 survey-based experiments funded by the NSF. Nearly 2/3 of the social science experiments that produced null results, those that did not support a hypothesis, were simply filed away. In contrast, researchers wrote up 96% of the studies with statistically strong results. Vines,2014
Evidence to encourage change Incentives to embrace change Training to enact change Technology to enable change Improving scientific ecosystem
Evidence to encourage Metascience Incentives to embrace Community Training to enact Improving scientific ecosystem Technology to enable Infrastructure
The larger projects are the Reproducibility Projects that empirically examine the rate and predictors of reproducibility in the published literature. These are our flagship projects with one investigating psychological science, ‘RP:P’, and the other cancer biology, ‘RP:CB’. The projects are the same in overall design, which use a sample of the published literature and perform direct replications, working with the original authors, when possible, to obtain all original materials and methods, performing high-powered replications, and evaluating the feasibility of conducting direct replications and covariates of replication success. Even though the overall design and aims are the same, the projects have adopted different executions because of disciplines with RP:P utilizing a large collaborative network of hundreds of independent researchers to design and conduct the replications (a true community effort), while RP:CB has volunteers draft replication protocols from the original paper, but uses paid biomedical laboratories that are part of the Science Exchange network to conduct the replications. Both projects are ongoing, although at different stages.
97% 37% xx Open Science Collaboration, 2015, Science
Open Science Collaboration, 2015, Science
https://osf.io/e81xl/
Registered Reports The team replicating a previously published study first submits a Registered Report that explains how it intends to replicate selected experiments from the original paper Each Registered Report is peer reviewed by several experts, including a biostatistician Once the Registered Report has been revised satisfactorily, it will be published
Registered Reports The replication team then starts to replicate the experiments, following the protocols detailed in the Registered Report Irrespective of the outcome, the results will be published as a Replication Study after peer review to check that the experiments were carried out in accordance with the protocols outlined in the Registered Report
https://cos.io/rpcb
Incentives to embrace change Supporting these behavioral changes requires improving the full scientific ecosystem. At a conference like IDCC, there are many people in the room contributing important parts to this ecosystem. I hope you leave this talk seeing the potential for how we might be able to work together on connecting tools to provide for better transparency and reproducibility in the workflow.
538 Journals 58 Organizations http://cos.io/top
TOP Design Principles Low barrier to entry: 3 levels (disclose, require, verify) Modular (8 standards) Agnostic to discipline
TOP Guidelines Data citation Design transparency Research materials transparency Data transparency Analytic methods (code) transparency Preregistration of studies Preregistration of analysis plans Replication
Transparency and Openness Promotion Standards Eight Standards Data Citation Design transparency Research materials transparency Data transparency Analytical methods transparency Preregistration of studies Preregistration of analysis plans Replications Three Tiers Disclose Require Verify
Why you might want to share Journal/Funder mandates Increase impact of work Recognition of good research practices
Incentives: Making Behaviors Visible Promotes Adoption Badges Open Data Open Materials Preregistration Psychological Science (Jan 2014)
Two Modes of Research Context of Discovery Exploration Data contingent Hypothesis generating Context of Justification Confirmation Data independent Hypothesis testing
Preregistration Purposes Why needed? Discoverability: Study exists Interpretability: Distinguish exploratory and confirmatory approaches Why needed? Mistaking exploratory as confirmatory increases publishability and decreases credibility of results
Positive Result Rate dropped from 57% to 8% after preregistration required.
The Pre-registration Challenge Educational content and the OSF workflow can be seen by going to https://cos.io/preregprivate (will go to staging server Wednesday/Thursday). Promotional video is here: https://youtu.be/SWkqdNppL-s One thousand scientists will win $1,000 each for publishing the results of their preregistered research. https://cos.io/prereg
Registered Reports PEER REVIEW Design Collect & Analyze Report Publish Review of intro and methods prior to data collection; published regardless of outcome
Who Publishes Registered Reports? So.. who publishes these things? Here’s a partial (and growing!) list. You can view the complete list on the Registered Reports project page on the OSF. There’s even a table comparing features of RRs across journals. Review of intro and methods prior to data collection; published regardless of outcome Beauty vs. accuracy of reporting Publishing negative results Conducting replications Peer review focuses on quality of methods (just to name a few) See the full list and compare features: osf.io/8mpji
Training to enact change Supporting these behavioral changes requires improving the full scientific ecosystem. At a conference like IDCC, there are many people in the room contributing important parts to this ecosystem. I hope you leave this talk seeing the potential for how we might be able to work together on connecting tools to provide for better transparency and reproducibility in the workflow.
Free training on how to make research more reproducible Partner with others on training --- librarians are great partners in this ---- to teach researchers skills in how to deal with basic data management and how to improve their research workflows for personal and sharing purposes. Software Carpentry and Data Carpentry are other great examples of efforts in this area, and partnerships with those in libraries --- we’ve done some work with them and are exploring ways to do more. Free training on how to make research more reproducible http://cos.io/stats_consulting
Technology to enable change Supporting these behavioral changes requires improving the full scientific ecosystem. At a conference like IDCC, there are many people in the room contributing important parts to this ecosystem. I hope you leave this talk seeing the potential for how we might be able to work together on connecting tools to provide for better transparency and reproducibility in the workflow.
More than just data access, sharing, and compliance There’s more to it than sharing of discrete objects. Think about using this as an opportunity to increase transparency by capturing the entire workflow, and to do so while connecting tools and services that make up the parts of the workflow, not requiring people to change all of their practices at once, and providing immediate efficiencies and value to the researcher AS they comply with requirements. Easy, right? Obviously not More than just data access, sharing, and compliance
osf.io Open Source Application Suite journals registries Workflow Authentication File Storage File Rendering Meta-database Integrations Search SHARE Data osf.io journals registries preprint servers grants management OSF is actually an application framework It is a public good and scholarly commons It supports the interface you see if you google OSF Blog engine, Slide sharing (osf.io/meetings) Workflow integration Authentication File storage File rendering Database Metadata/Annotations/Commenting External service integrations Search SHARE Data peer review services curation, annotation
http://osf.io/ free, open source The Open Science Framework a hosted webapplication and public good application framework, connects the infrastructure of science to enhance transparency, increase reproducibility, and accelerate innovation. The OSF supports project management and collaboration, connects services across the research lifecycle, and archives data, materials, and other research objects for private use or public sharing. Share data, share materials, show the research process – confirmatory result make it clear, exploratory discovery make it clear; demonstrate the ingenuity, perspiration, and learning across false starts, errant procedures, and early hints – doesn’t have to be written in painstaking detail in the final report, just make it available. http://osf.io/ free, open source
There’s more to it than sharing of discrete objects There’s more to it than sharing of discrete objects. Think about using this as an opportunity to increase transparency by capturing the entire workflow, and to do so while connecting tools and services that make up the parts of the workflow, not requiring people to change all of their practices at once, and providing immediate efficiencies and value to the research AS they comply with requirements. Easy, right? Obviously not. Workflow – level of abstraction that is shared across areas of scholarly inquiry, particularly the sciences Functions that support that workflow
OpenSesame
In early 2016, The Center for Open Science launched new features for the OSF to provide a more efficient user experience within the institutional environment. Through collaboration with institutions, COS is offering a branded, dedicated landing page to display all public projects affiliated with the institution. Connect, organize, and document research. The OSF integrates data repositories, study registries, computational services, and other services researchers use into a single workflow to increase accessibility and usability, and to increase efficiency and reproducibility simultaneously. It eliminates the need for time spent tracking data sources and collecting them to share or publish. Manage projects. Researchers create projects, add content, and document the activities of labs, teams or larger consortia of researchers around the world. Builtin version control, granular privacy controls, integrations with other services, and citable persistent identifiers streamline researcher workflows and incentivize sharing. Share work. T he OSF makes transparency and collaborating easy and can be tailored to individual needs. Share work publicly, only with collaborators, or provide access to select groups. Customizable permissions allow you to share what you want, when you want, with whom you want, and ensure that private content stays private. Preserve for the future. OSF provides highly secure, redundant archiving services for confident preservation of research materials and data.
Find this presentation at: https://osf.io/83tng/ Questions: tim@cos.io
Resources: Reproducible Research Practices? The OSF? stats-consulting@cos.io The OSF? support@osf.io Have feedback for how we could support you more? contact@cos.io feedback@cos.io