Download presentation
Presentation is loading. Please wait.
Published byGeraldine Reed Modified over 6 years ago
1
Improving Rigor and Reproducibility of Scientific Research
Tim Errington Center for Open Science We have an opportunity in front of us to make some real changes in how science is done. I’m going to talk about a slightly different approach to supporting data sharing requirements, and one which is made possible through collaborations and partnerships with others, including perhaps most prominently, those who manage and preserve scientific knowledge: publishers and librarians. COS is a non-profit technology company providing free and open services to increase inclusivity and transparency of research. COS supports shifting incentives and practices to align more closely with scientific values. Challenges I face when working to advance scientific knowledge and my career at the same time. And, how my scientific practices can be adapted to meet my scientific values. Two of the cornerstones of science advancement are rigor in designing and performing scientific research and the ability to reproduce biomedical research findings. The application of rigor ensures robust and unbiased experimental design, methodology, analysis, interpretation, and reporting of results. When a result can be reproduced by multiple scientists, it validates the original results and readiness to progress
2
Mission: Improve openness, integrity, and reproducibility of scientific research
First, let me give you a bit of background on the Center for Open Science. COS is a non-profit technology start-up in Charlottesville, VA, with the mission of improving openness, integrity, and reproducibility of scientific research. Founded in March of 2013, we’ve grown to over 30 fulltime employees (and 25+ interns), working in 3 main areas: Infrastructure, community, and metascience. Most of what I’m going to talk about today falls under the community umbrella: bringing together communities of researchers, funders, editors, and other stakeholders interested in improving the reproducibility of research.
3
Evidence to encourage Metascience
Incentives to embrace Community Training to enact Improving scientific ecosystem Technology to enable Infrastructure
4
Improving the alignment between scientific values and scientific practices
Supporting these behavioral changes requires improving the full scientific ecosystem. At a conference like IDCC, there are many people in the room contributing important parts to this ecosystem. I hope you leave this talk seeing the potential for how we might be able to work together on connecting tools to provide for better transparency and reproducibility in the workflow.
5
Norms Counternorms Communality Universalism Disinterestedness
Open sharing Universalism Evaluate research on own merit Disinterestedness Motivated by knowledge and discovery Organized skepticism Consider all new evidence, even against one’s prior work Quality Secrecy Closed Particularlism Evaluate research by reputation Self-interestedness Treat science as a competition Organized dogmatism Invest career promoting one’s own theories, findings Quantity Communality – open sharing with colleagues; Secrecy Universalism – research evaluated only on its merit; Particularism – research evaluated by reputation/past productivity Disinterestedness – scientists motivated by knowledge and discovery, not by personal gain; self-interestedness – treat science as a competition with other scientists Organized skepticism – consider all new evidence, theory, data, even if it contradicts one’s prior work/point-of-view; organized dogmatism – invest career in promoting one’s own most important findings, theories, innovations Quality – seek quality contributions; Quantity – seek high volume Communality refers to the shared ownership of scientific methods and findings. Universalism is the principle that a scientist’s findings and body of work should be judged on the basis of their merit, without reference to the sci- entist’s personal or other irrelevant characteristics. Disinterestedness represents the understanding that sci- entists’ work should be free of self-interested motivation and pursuit of wealth. Organized skepticism requires that scientific findings be made available for scrutiny and that such scrutiny be performed in accordance with accepted scientific standards. Merton,1942;
6
Anderson, Martinson, & DeVries, 2007
3,247 mid- and early-career scientists who had research funding from NIH. ideal to which most scientists subscribe scientists perceptions of their own behavior scientists perceptions of their peer’s behaviors self-regulation, substantiall autonomy, the complexity of scientific projects, professional expertise, innovative work on cutting-edge problems, and a system of largely voluntary compliance with regulation and codes of ethics all point to the futility and inadvisability of direct administrative control over scientists’ behavior Anderson, Martinson, & DeVries, 2007
7
Barriers Perceived norms (Anderson, Martinson, & DeVries, 2007)
Motivated reasoning (Kunda, 1990) Minimal accountability (Lerner & Tetlock, 1999) Concrete rewards beat abstract principles (Trope & Liberman, 2010) I am busy (Me & You, 2017) We can understand the nature of the challenge with existing psychological theory. For example: 1. I have beliefs, ideologies, and achievement motivations that influence how I interpret and report my research (motivated reasoning; Kunda, 1990). And, even if I am trying to resist this motivated reasoning. I may simply be unable to detect it in myself, even when I can see those biases in others. 2. And, what biases might influence me. Well, pick your favorite. My favorite in this context is the hindsight bias. 3. What’s more is we face these potential biases in a context of minimal accountability. What you know of my laboratory work is only what you get in the published report. … 4. The goals and rewards of publishing are immediate and concrete; the rewards of getting it right are distal and abstract (Trope & Liberman) 5. Finally, even if I am prepared to accept that I have these biases and am motivated to address them so that I can get it right. I am busy. So are you. If I introduce a whole bunch of new things that I must now do to check and correct for my biases, I will kill my productivity and that of my collaborators. So, the incentives lead me to think that my best course of action is to just to the best I can and hope that I’m doing it okay.
8
Central Features of Science
Scientific Ideals Innovative ideas Reproducible results Accumulation of knowledge Transparency Reproducibility Central Features of Science Science operates under three basic ideals Science aims to make new discoveries. The findings of science should be reproducible. I should be able to find an effect multiple times, and other researchers should also be able to find the same effects. The findings of individual studies should be able to build off one another, with each study being a reliable piece of evidence that build towards some broader understanding of a true phenomena. Two central features of science are transparency and reproducibility (Bacon, 1267;1859; Jasny et al., 2011; Kuhn, 1962; Merton, 1942; Popper, 1934/1992). Transparency requires scientists to publish their methodology and data so that the merit of a claim can be assessed on the basis of the evidence rather than the reputation of those making the claim. Reproducibility can refer to both the ability of others to reproduce the findings, given the original data, and to the generation of new data that supports the same conclusions. If all published results were true and their effect sizes estimated precisely, then a singular focus on innovation over verification might be inconsequential, because the effect size would be reliable. In such a context, the most efficient means of knowledge accumulation would be to spend all resources on discovery and trust that each published result provided an accurate estimate of effects on which to build or extend. However, if not all published results are true and if effect sizes are misestimated, then an absence of replication and verification will lead to a published literature that misrepresents reality. The consequences of that scenario would depend on the magnitude of the mis-estimation.
9
What is reproducibility?
Computation Reproducibility: If we took your data and code/analysis scripts and reran it, we can reproduce the numbers/graphs in your paper Empirical Reproducibility: We have enough information to rerun the experiment or survey the way it was originally conducted Replicability: We use your exact methods and analyses, but collect new data, and we get the same results
10
Conceptual Replication
Direct Replication Same procedure on new samples Tests the current beliefs to produce a finding Establish that a finding is reproducible Does not guarantee validity Different procedure Test the same hypothesis Evidence to converge on an explanation for a finding Does not guarantee reproducibility Conceptual Replication Science operates under three basic ideals Science aims to make new discoveries. The findings of science should be reproducible. I should be able to find an effect multiple times, and other researchers should also be able to find the same effects. The findings of individual studies should be able to build off one another, with each study being a reliable piece of evidence that build towards some broader understanding of a true phenomena. Two central features of science are transparency and reproducibility (Bacon, 1267;1859; Jasny et al., 2011; Kuhn, 1962; Merton, 1942; Popper, 1934/1992). Transparency requires scientists to publish their methodology and data so that the merit of a claim can be assessed on the basis of the evidence rather than the reputation of those making the claim. Reproducibility can refer to both the ability of others to reproduce the findings, given the original data, and to the generation of new data that supports the same conclusions. If all published results were true and their effect sizes estimated precisely, then a singular focus on innovation over verification might be inconsequential, because the effect size would be reliable. In such a context, the most efficient means of knowledge accumulation would be to spend all resources on discovery and trust that each published result provided an accurate estimate of effects on which to build or extend. However, if not all published results are true and if effect sizes are misestimated, then an absence of replication and verification will lead to a published literature that misrepresents reality. The consequences of that scenario would depend on the magnitude of the mis-estimation.
11
Why should you care? To increase the efficiency of your own work
Hard to build off our own work, or work of others We may not have the knowledge we think we have Hard to even check this if reproducibility low
12
Unfortunately, it has become apparently over the last few years that perhaps the answer to that question is not all that much. Now, there have been some very prominent cases in the past few years of outright fraud, where people have completely fabricated their data, but I’m not talking about those case. What I’m talking is the general sense that many scientific findings in a wide variety of fields don’t replicate, and that the published literature has a very high rate of false-positives in it. So if a large proportion of our published results aren’t replicable, and are potential false positives, Are we actually accumulating knowledge about real phenomena? I would suggest that the answer to this question is ‘no’, or at least we haven’t accumulated as much knowledge as we would like to believe. These three ideals should be exemplified within the published literature, however there is growing evidence that much of the published literature may be less reliable than we would wish, leading to an inefficient accumulation of knowledge. This unreliability, the “reproducibility crisis” in science, is evidenced by almost daily articles and news stories describing research that has failed to replicate. One factor contributing to the low reliability of scientific findings is the lack of openness in science. Lack of transparency around the research process and the research product (data!), at best, leaves other researchers and replicators in the dark about research design and data analysis decisions. Providing access to data and materials allows others to reproduce your work and build on it. Without replication, false positives can persist in the literature - inhibiting scientific progress. It is important to note that what I am NOT talking about in these cases is fraud. Now, there have been some very prominent cases in the past few years of outright fraud, where people have completely fabricated their data, but I’m not talking about those cases. What I’m talking is the general sense that many scientific findings in a wide variety of fields don’t replicate, and that the published literature has a very high rate of false-positives in it. The Center for Open Science is dedicated to improving the reproducibility of scientific research -- but first let’s examine the challenges associated with that. The small amount of direct evidence about reproducibility converges with the conclusions of these systematic reviews. A survey of faculty and trainees at the MD Anderson Cancer Center found half of those researchers reported an inability to reproduce data on at least one occasion (Mobley et al., 2013). More dramatically, two industrial laboratories, Bayer and Amgen, reported reproducibility rates of 11% and 25% in two independent efforts to reproduce findings from dozens of groundbreaking basic science studies in oncology and related areas (Begley and Ellis, 2011; Prinz et al., 2011). The available evidence suggests that published research is less reproducible than assumed and desired, perhaps because of an inflation of false positives and a culture of incentives that values publication over accuracy (Nosek et al., 2012), but, the evidence is incomplete. The Bayer and Amgen reports of failing to reproduce a high proportion of results provide the most direct evidence. However, neither report made available the effects investigated, the sampling process, the methodology, or the data that comprised the replication efforts (Nature, 2012). Some evidence from bio-medical research suggests that this is occurring. Two different industrial laboratories attempted to replicate 40 or 50 basic science studies that showed positive evidence for markers for new cancer treatments or other issues in medicine. They did not select at random. Instead, they picked studies considered landmark findings. The success rates for replication were about 25% in one study and about 10% in the other. Further, some of the findings they could not replicate had spurred large literatures of hundreds of articles following up on the finding and its implications, but never having tested whether the evidence for the original finding was solid. This is a massive waste of resources. Across the sciences, evidence like this has spurred lots of discussion and proposed actions to improve research efficiency and avoid the massive waste of resources linked to erroneous results getting in and staying in the literature, and about the culture of scientific practices that is rewarding publishing, perhaps at the expense of knowledge building. There have been a variety of suggestions for what to do. For example, the Nature article on the right suggests that publishing standards should be increased for basic science research. [It is not in my interest to replicate – myself or others – to evaluate validity and improve precision in effect estimates (redundant). Replication is worth next to zero (Makel data on published replications; motivated to not call it replication; novelty is supreme – zero “error checking”; not in my interest to check my work, and not in your interest to check my work (let’s just each do our own thing and get rewarded for that) Irreproducible results will get in and stay in the literature (examples from bio-med). Prinz and Begley articles (make sure to summarize accurately) The Nature article by folks in bio-medicine is great. The solution they offer is a popular one in commentators from the other sciences -- raise publishing standards.
13
Academic Life Science Research Process
Over the past 30 years, life science research advances have benefited the life and health of millions of people. For example, with treatment, HIV patients in the United States and Canada can expect to live an almost a normal life span of over 70 years.13 Routinely recommended genetic testing for diseases such as cystic fibrosis (CF) allows detection and estimation of disease risk prior to pregnancy, enabling couples to make informed decisions, and some CF-affected individuals to live until adulthood.14 These advances are the result of translating basic research findings into clinical treatments or laboratory tests. And importantly, over 50% of basic research is performed in the academic setting.15 The current practices used to maintain research quality are deeply entwined with the fundamental tenets of academic research culture and have successfully produced most biological and treatment advances that we enjoy today. However, changes in the life science landscape, including rising complexity, competition, economic challenges, and translational focus, increase the challenges associated with maintaining life science research quality through traditional systems. The general flow of the current academic life science research process is summarized in Figure 3. Currently, the maintenance of life science research quality relies on a set of key checkpoints (Figure 4). Although the process is iterative, there are currently few standards surrounding quality management in life science research. Lack of broadly-accepted standards results in extensive variability in the ways these quality checkpoints are implemented. For example, academic research leaders and researchers report large differences between laboratories as to which steps of the research process are routinely monitored by the principal investigator (PI). Only a few steps, such as manuscript preparation, are carefully checked for quality. Peer review of manuscripts prior to publication is a particularly powerful checkpoint that many stakeholders believe fundamentally protects research integrity. However, journals exhibit significant diversity in requirements for publication as well as in the criteria that different journals and reviewers use to evaluate work.16 Currently, few broadly implemented standards exist to help journals align around requirements for publication. Although some journals may incorporate guidelines and standards into publication decisions, many do not. Publications release and disseminate results into the broader scientific community where findings and interpretation face additional scrutiny. The feedback from the life science community suggests that researchers believe that when the journal peer review process fails and erroneous data or interpretations get published, highly important or controversial results will be replicated and either confirmed or disproved by other laboratories within a few years. If findings are reproducible, other laboratories will use them as the basis for additional research, resulting in further publications and the growth of the field. If findings are not reproducible, the situation is more variable. When the original hypothesis is of tremendous significance, the irreproducibility of results is usually published, although this process can take several years. For results of less significance, the inability to replicate findings is frequently not discovered nor reported. In some cases, reporting guidelines for certain types of studies or laboratory procedures may be developed by a professional society, and adherence to guidelines may be required by its journal as a condition for publication. For example, Cytometry A and Cytometry B, the journals of the International Society for Advancement of Cytometry (ISAC), require adherence to the MiFlowCyt standard for minimum information required to report the experimental details of flow cytometry experiments.17 However, this type of standard is rare in life science research. The current research landscape is shaped by multiple changes, including novel technologies, increasing specialization, and imperatives to make a faster impact on patient outcomes. These changes are exerting pressure on traditional paradigms for maintaining research quality (Figure 5). GBSI, 2013
14
The hypothetico-deductive scientific model of the scientific method can be short-circuited by a range of questionable research practices -- shown in red. Lack of replication inhibits the elimination of false discoveries. Low statistical power increases the chances of missing true discoveries, and reduces the likelihood that obtained positive results are real. P-hacking -- exploiting degrees of freedom -- manifests in two general ways: collecting data only until analyses return statistically significant effects and selectively reporting analyses that reveal desirable outcomes HARKing, or “Hypothesizing After Results are Known” -- is generating a hypothesis from the data and then presenting it as a priori Publication bias occurs when journals reject manuscripts on the basis that they report negative or undesirable findings. So, how can we as researchers eliminate questionable research practices and prevent publication bias? Read more: osf.io/8mpji
15
Flexibility in analysis
Problems Flexibility in analysis Selective reporting Ignoring nulls Lack of replication Examples from: Button et al – Neuroscience Ioannidis – why most results are false (Medicine) GWAS Biology Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Some evidence from bio-medical research suggests that this is occurring. Two different industrial laboratories attempted to replicate 40 or 50 basic science studies that showed positive evidence for markers for new cancer treatments or other issues in medicine. They did not select at random. Instead, they picked studies considered landmark findings. The success rates for replication were about 25% in one study and about 10% in the other. Further, some of the findings they could not replicate had spurred large literatures of hundreds of articles following up on the finding and its implications, but never having tested whether the evidence for the original finding was solid. This is a massive waste of resources. Across the sciences, evidence like this has spurred lots of discussion and proposed actions to improve research efficiency and avoid the massive waste of resources linked to erroneous results getting in and staying in the literature, and about the culture of scientific practices that is rewarding publishing, perhaps at the expense of knowledge building. There have been a variety of suggestions for what to do. For example, the Nature article on the right suggests that publishing standards should be increased for basic science research. [It is not in my interest to replicate – myself or others – to evaluate validity and improve precision in effect estimates (redundant). Replication is worth next to zero (Makel data on published replications; motivated to not call it replication; novelty is supreme – zero “error checking”; not in my interest to check my work, and not in your interest to check my work (let’s just each do our own thing and get rewarded for that) Irreproducible results will get in and stay in the literature (examples from bio-med). Prinz and Begley articles (make sure to summarize accurately) The Nature article by folks in bio-medicine is great. The solution they offer is a popular one in commentators from the other sciences -- raise publishing standards. Sterling, 1959; Cohen, 1962; Lykken, 1968; Tukey, 1969; Greenwald, 1975; Meehl, 1978; Rosenthal, 1979
16
Researcher Degrees of Freedom
All data processing and analytical choices made after seeing and interacting with your data “Does X affect Y?” Should I collect more data? Exclude outliers? Control for expression? Median or mean? Jorge Luis Borges; Gelman and Loken
17
Figure created by 538 Silberzahn et al., 2015
19
Challenges Selective Reporting: File drawer phenomenon
In a study published in August a team at Stanford traced the publication outcomes of 221 survey-based experiments funded by the NSF. Nearly 2/3 of the social science experiments that produced null results, those that did not support a hypothesis, were simply filed away. In contrast, researchers wrote up 96% of the studies with statistically strong results. Franco,2014
20
Button et al., 2013, Nature Reviews Neuroscience
Figure 3 | Median power of studies included in neuroscience meta-analyses. The figure shows a histogram of median study power calculated for each of the n = 49 meta-analyses included in our analysis, with the number of meta-analyses (N) on the left axis and percent of meta-analyses (%) on the right axis. There is a clear bimodal distribution; n = 15 (31%) of the meta-analyses comprised studies with median power of less than 11%, whereas n = 7 (14%) comprised studies with high average power in excess of 90%. Despite this bimodality, most meta-analyses comprised studies with low statistical power: n = 28 (57%) had median study power of less than 31%. The meta-analyses (n = 7) that comprised studies with high average power in excess of 90% had their broadly neurological subject matter in common. Simultaneously, across disciplines, the average power of studies to detect positive results is quite low (Button et al., 2013; Cohen, 1962; Ioannidis, 2005). In neuroscience, for example, Button et al. observed the median power of studies to be 21% (Button et al., 2013), which means that assuming the finding being investigated is true and accurately estimated, then only 21 of every 100 studies investigating that effect would detect statistically significant evidence for the effect. Most studies would miss detecting the true effect. The implication of very low power is that the research literature would be filled with lots of negative results, regardless of whether the effects actually exist or not. In the case of neuroscience, assuming all investigated effects in the published literature are true, only 21% of the studies should have obtained a significant, positive result detecting that effect. However, Fanelli observed a positive result rate of 85% in neuroscience (Fanelli, 2010). This discrepancy between observed power and observed positive results is not statistically possible. Instead, it suggests systematic exclusion of negative results (Greenwald, 1975) and possibly the exaggeration of positive results by employing flexibility in analytic and reporting practices that inflate the likelihood of false positives (Simmons et al., 2011). The power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis. Statistical power is inversely related to beta or the probability of making a Type II error. Let's start our discussion of statistical power by recalling two definitions we learned when we first introduced to hypothesis testing: A Type I error occurs if we reject the null hypothesis H0 (in favor of the alternative hypothesis HA) when the null hypothesis H0 is true. We denote α = P(Type I Error). A Type II error occurs if we fail to reject the null hypothesis H0 when the alternative hypothesis HA is true. We denote β = P(Type II Error). In general, for every hypothesis test that we conduct, we'll want to do the following: (1) Minimize the probability of committing a Type I error. That, is minimize α = P(Type I Error). Typically, a significance level of α ≤ 0.10 is desired. (2) Maximize the power (at a value of the parameter under the alternative hypothesis that is scientifically meaningful). Typically, we desire power to be 0.80 or greater. Alternatively, we could minimize β = P(Type II Error), aiming for a type II error rate of 0.20 or less. Button et al., 2013, Nature Reviews Neuroscience
21
There is evidence that our published literature is too good to be true.
Daniele Fanelli did an analysis of what gets published across scientific disciplines and found that all disciplines had positive result rates of 70% or higher. From physics through psychology, the rates were 85-92%. Consider our field’s 92% positive result rate in comparison to the average power of published studies. Estimates suggest that the average psychology study has a power of somewhere around .5 to .6 to detect its effects. So, if all published results were true, we’d expect somewhere between 50-60% of the critical tests to reject the null hypothesis. But we get 92%. That does not compute. Something is askew in the accumulating evidence. [It is not in my interest to write up negative results, even if they are true, because they are less likely to be published (negative) – file-drawer effect] The accumulating evidence suggests an alarming degree of mis-estimation. Across disciplines, most published studies demonstrate positive results – results that indicate an expected association between variables or a difference between experimental conditions (Fanelli, 2010, 2012; Sterling, 1959). Fanelli observed a positive result rate of 85% in neuroscience (Fanelli, 2010). Fanelli,2010,
22
Were experiments performed blinded?
Were basic experiments repeated? Were all the results presented? Were there positive and negative controls? Were reagents validated? Were statistical tests appropriate? The first new flag concerns the application of a multiple- hypothesis correction. The large number of statistical comparisons made in high- throughput data analyses will inflate estimates of significance by increasing the probability that an individual result at a particular significance could occur by chance. The second flag questions whether an appropriate background distribution was used. It is vital to choose a set of variables appropriate to the question that the experimental results are being tested against for significance. An inappropriate choice of background can artificially induce significance in the results or mask real results. An example would be to sample a women’s basketball team to determine whether there is a significant height difference between men and women.
23
Problems Unrecognized experimental variables Disorganization Poor documentation of methodology Misinterpretation of noise as an indication of a positive finding Loss of Materials and Data Infrequent Sharing Examples from: Button et al – Neuroscience Ioannidis – why most results are false (Medicine) GWAS Biology Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Some evidence from bio-medical research suggests that this is occurring. Two different industrial laboratories attempted to replicate 40 or 50 basic science studies that showed positive evidence for markers for new cancer treatments or other issues in medicine. They did not select at random. Instead, they picked studies considered landmark findings. The success rates for replication were about 25% in one study and about 10% in the other. Further, some of the findings they could not replicate had spurred large literatures of hundreds of articles following up on the finding and its implications, but never having tested whether the evidence for the original finding was solid. This is a massive waste of resources. Across the sciences, evidence like this has spurred lots of discussion and proposed actions to improve research efficiency and avoid the massive waste of resources linked to erroneous results getting in and staying in the literature, and about the culture of scientific practices that is rewarding publishing, perhaps at the expense of knowledge building. There have been a variety of suggestions for what to do. For example, the Nature article on the right suggests that publishing standards should be increased for basic science research. [It is not in my interest to replicate – myself or others – to evaluate validity and improve precision in effect estimates (redundant). Replication is worth next to zero (Makel data on published replications; motivated to not call it replication; novelty is supreme – zero “error checking”; not in my interest to check my work, and not in your interest to check my work (let’s just each do our own thing and get rewarded for that) Irreproducible results will get in and stay in the literature (examples from bio-med). Prinz and Begley articles (make sure to summarize accurately) The Nature article by folks in bio-medicine is great. The solution they offer is a popular one in commentators from the other sciences -- raise publishing standards.
24
How quality control could save your science
Baker, 2016
25
How quality control could save your science
DISORGANIZED SAMPLE STORAGE Clear labeling and proper organization are important for incubators and freezers. Everyone in the lab should be able to identify a sample, where it came from, who did what to it, how old it is and how it should be stored. INADEQUATE DATA LOGGING Data should be logged in a lab notebook, not scribbled onto memo paper or other detritus and carelessly transcribed. Notebooks should be bound or digital; loose paper can too easily be lost or removed. VARIABLE EXPERIMENTS Protocols should be followed to the letter or deviations documented. If reagents need to be kept on ice while in use, each lab member must comply. UNSECURED DATA ANALYSIS Each lab member should have their own password for accessing and working with data, to make it clear who works on what, when. Some popular spreadsheet programs can be locked down so that manipulating data, even accidentally, is difficult. MISSED MAINTENANCE Instruments should be calibrated and maintained according to a regular, documented schedule. OLD AND UNDATED REAGENTS These can affect experimental results. Scientists should specify criteria for age and storage of all important reagents. Baker, 2016
26
Challenges in sharing Reproducing prior results is challenging because of insufficient, incomplete, or inaccurate reporting of methodologies (Hess, 2011; Prinz et al., 2011; Steward et al., 2012; Hackam and Redelmeier, 2006, Landis et al., 2011). Further, a lack of information about research resources makes it difficult or impossible to determine what was used in a published study (Vasilevsky et al., 2013). These challenges are compounded by the lack of funding support available from agencies and foundations to support replication research. Finally, reproducing analyses with prior data is difficult because researchers are often reluctant to share data, even when required by funding bodies or scientific societies (Wicherts et al., 2006), and because data loss increases rapidly with time after publication (Vines et al., 2014). Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Manufacturing beauty: Flexibility in Analysis Selective Reporting Presenting Exploratory as Confirmatory Selective Reporting: File drawer phenomenon In a study published in August a team at Stanford traced the publication outcomes of 221 survey-based experiments funded by the NSF. Nearly 2/3 of the social science experiments that produced null results, those that did not support a hypothesis, were simply filed away. In contrast, researchers wrote up 96% of the studies with statistically strong results. Vines,2014
27
Unique identification of research resources in the biomedical literature
Reproducing prior results is challenging because of insufficient, incomplete, or inaccurate reporting of methodologies (Hess, 2011; Prinz et al., 2011; Steward et al., 2012; Hackam and Redelmeier, 2006, Landis et al., 2011). Further, a lack of information about research resources makes it difficult or impossible to determine what was used in a published study (Vasilevsky et al., 2013). These challenges are compounded by the lack of funding support available from agencies and foundations to support replication research. Finally, reproducing analyses with prior data is difficult because researchers are often reluctant to share data, even when required by funding bodies or scientific societies (Wicherts et al., 2006), and because data loss increases rapidly with time after publication (Vines et al., 2014). Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Manufacturing beauty: Flexibility in Analysis Selective Reporting Presenting Exploratory as Confirmatory Selective Reporting: File drawer phenomenon In a study published in August a team at Stanford traced the publication outcomes of 221 survey-based experiments funded by the NSF. Nearly 2/3 of the social science experiments that produced null results, those that did not support a hypothesis, were simply filed away. In contrast, researchers wrote up 96% of the studies with statistically strong results. Vasilevsky, 2013
28
Resource Identification Initiative
Reproducing prior results is challenging because of insufficient, incomplete, or inaccurate reporting of methodologies (Hess, 2011; Prinz et al., 2011; Steward et al., 2012; Hackam and Redelmeier, 2006, Landis et al., 2011). Further, a lack of information about research resources makes it difficult or impossible to determine what was used in a published study (Vasilevsky et al., 2013). These challenges are compounded by the lack of funding support available from agencies and foundations to support replication research. Finally, reproducing analyses with prior data is difficult because researchers are often reluctant to share data, even when required by funding bodies or scientific societies (Wicherts et al., 2006), and because data loss increases rapidly with time after publication (Vines et al., 2014). Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Manufacturing beauty: Flexibility in Analysis Selective Reporting Presenting Exploratory as Confirmatory Selective Reporting: File drawer phenomenon In a study published in August a team at Stanford traced the publication outcomes of 221 survey-based experiments funded by the NSF. Nearly 2/3 of the social science experiments that produced null results, those that did not support a hypothesis, were simply filed away. In contrast, researchers wrote up 96% of the studies with statistically strong results.
29
RRIDs Reproducing prior results is challenging because of insufficient, incomplete, or inaccurate reporting of methodologies (Hess, 2011; Prinz et al., 2011; Steward et al., 2012; Hackam and Redelmeier, 2006, Landis et al., 2011). Further, a lack of information about research resources makes it difficult or impossible to determine what was used in a published study (Vasilevsky et al., 2013). These challenges are compounded by the lack of funding support available from agencies and foundations to support replication research. Finally, reproducing analyses with prior data is difficult because researchers are often reluctant to share data, even when required by funding bodies or scientific societies (Wicherts et al., 2006), and because data loss increases rapidly with time after publication (Vines et al., 2014). Two possibilities are that the percentage of positive results is inflated because negative results are much less likely to be published, and that we are pursuing our analysis freedoms to produce positive results that are not really there. These would lead to an inflation of false-positive results in the published literature. Manufacturing beauty: Flexibility in Analysis Selective Reporting Presenting Exploratory as Confirmatory Selective Reporting: File drawer phenomenon In a study published in August a team at Stanford traced the publication outcomes of 221 survey-based experiments funded by the NSF. Nearly 2/3 of the social science experiments that produced null results, those that did not support a hypothesis, were simply filed away. In contrast, researchers wrote up 96% of the studies with statistically strong results.
30
BioSharing.org DISORGANIZED SAMPLE STORAGE
Clear labelling and proper organization are important for incubators and freezers. Everyone in the lab should be able to identify a sample, where it came from, who did what to it, how old it is and how it should be stored. INADEQUATE DATA LOGGING Data should be logged in a lab notebook, not scribbled onto memo paper or other detritus and carelessly transcribed. Notebooks should be bound or digital; loose paper can too easily be lost or removed. VARIABLE EXPERIMENTS Protocols should be followed to the letter or deviations documented. If reagents need to be kept on ice while in use, each lab member must comply. UNSECURED DATA ANALYSIS Each lab member should have their own password for accessing and working with data, to make it clear who works on what, when. Some popular spreadsheet programs can be locked down so that manipulating data, even accidentally, is difficult. MISSED MAINTENANCE Instruments should be calibrated and maintained according to a regular, documented schedule. OLD AND UNDATED REAGENTS These can affect experimental results. Scientists should specify criteria for age and storage of all important reagents.
31
Why you might want to share
Journal/Funder mandates Increase impact of work Other can easily replicate/understand your work Others can reuse/build on your data/analysis/etc Recognition of good research practices
32
Sólymos, P. & Fehér, Z. (2008): http://biogeography. blogspot. com
Peng, R. (2011):
33
Two Modes of Research Context of Discovery Exploration Data contingent Hypothesis generating Context of Justification Confirmation Data independent Hypothesis testing
34
A reader quick, keen, and leery Did wonder, ponder, and query
When results clean and tight Fit predictions just right If the data preceded the theory Anonymous, quoted from Kerr (1998)
35
Preregistration Purposes Why needed?
Discoverability: Study exists Interpretability: Distinguish exploratory and confirmatory approaches Why needed? Mistaking exploratory as confirmatory increases publishability and decreases credibility of results
36
Solution: Pre-registration
Before data is collected, specify The what of the study/experiment Research question Population Primary outcome General design Pre-analysis plan Information on exact analysis that will be conducted Sample size Data processing and cleaning procedures Exclusion criterion Statistical Analyses Registered in a read-only format and time-stamped Decreases researcher degrees of freedom, so smaller chance of obtaining a false-positive through data-based decisions Combats publication bias and selective reporting Registration holds you accountable to your self and to others
37
https://cos.io/prereg
38
Discrepancies in drug sensitivity
The authors found that the gene-expression profiles, which were obtained from micro array studies, showed quite good concordance between the two projects, whereas the pharmacological assays did not (Fig. 1). But that should come as no surprise. The pharmacological assay used by the CGP (the CellTiter 96 AQueous One Solution Cell Proliferation Assay from Promega) measures metabolic activity in terms of a reductase-enzyme product after a 72-hour incubation of cells with a drug; that used by the CCLE (the CellTiter- Glo assay from Promega) measures metabolic activity by assessing levels of the energy-transfer molecule ATP, after 72–84 hours of incubation. Both assays provide indices of the drug’s activity against the cells, but they would not be expected to mirror each other across all cell and drug types, even if run in parallel (and neither may be the best indicator of cell viability). Furthermore, many variables can affect the quantitative results obtained in such assays. For example, drug sensitivities can diverge if different batches of fetal bovine serum (an ingredient of cell-culture medium that varies in its content of cytokines and other biologically active molecules) are used. The time and conditions of the cells’ incubation before the drug is added, the coating on the plastic culture wells, intra-study batch or trend effects and other such arcane factors can all be influential. In this case, the intrinsic sensitivities of the two assays as analysed were also different: for 12 of the 15 drugs in question, the CCLE assay was not sensitive enough to reach its end- point for a large fraction of the cell types; in the CGP study, a mathematical extrapolation was used to obtain quantitative results in such cases. Haibe-Kains et al. performed extensive analyses to take account of such issues, but more experimental data would be required to pin down the true reasons for the discrepancies they highlight. Overall, if there is any surprise about the discordance between the two pharmacological data sets, it is quantitative, rather than qualitative. Weinstein & Lorenzi, 2013
39
Science is knowledge obtained by repeated experiment or observation
Fig 1 | Displaying data from replicates—what not to do. (A) Data for plate 1 only (shown in Table 1). (B) Means ± SE for replicate plates 1–3 (in Table 1), *P > (C) Means ± SE for replicate plates 1–10 (in Table 1), *P < (D) Means ± SE for HH-CSF-treated replicate plates 1–10 (in Table 1). Statistics should not be shown for replicates because they merely indicate the fidelity with which the replicates were made, and have no bearing on the hypothesis being tested. In each of these figures, n = 1 and the size of the error bars in (B), (C) and (D) reflect sampling variation of the replicates. The SDs of the replicates would be expected to be roughly the square root of the mean number of colonies. Also, axes should commence at 0, other than in exceptional circumstances, such as for log scales. SD, standard deviation; SE, standard error. Fundamental principle 1 Science is knowledge obtained by repeated experiment or observation: if n = 1, it is not science, as it has not been shown to be reproducible. You need a random sample of independent measurements. Fundamental principle 2 Experimental design, at its simplest, is the art of varying one factor at a time while controlling others: an observed difference between two conditions can only be attributed to Factor A if that is the only factor differing between the two conditions. We always need to consider plausible alternative interpretations of an observed result. The differences observed in Fig 1 might only reflect differences between the two suspensions, or be due to some other (of the many) differences between the two individual mice, besides the particular genotypes of interest. Fundamental principle 3 A conclusion can only apply to the population from which you took the random sample of independent measurements: so if we have multiple measures on a single suspension from one individual mouse, we can only draw a conclusion about that particular suspension from that particular mouse. If we have multiple measures of the activity of a single vial of cytokine, then we can only generalize our conclusion to that vial. Fundamental principle 4 Although replicates cannot support inference on the main experimental questions, they do provide important quality controls of the conduct of experiments. Values from an outlying replicate can be omitted if a convincing explanation is found, although repeating part or all of the experiment is a safer strategy. Results from an independent sample, however, can only be left out in exceptional circumstances, and only if there are especially compelling reasons to justify doing so. Vaux, et al., 2012
40
Experimental design is the art of varying one factor at a time while controlling others
Fig 2 | Sample variation. Variation between samples can be used to make inferences about the population from which the independent samples were drawn (red arrows). For replicates, as in (A), inferences can only be made about the bone marrow suspensions from which the aliquots were taken. In (A), we might be able to infer that the plates on the left and the right contained cells from different suspensions, and possibly that the bone marrow cells came from two different mice, but we cannot make any conclusions about the effects of the different genotypes of the mice. In (B), three independent mice were chosen from each genotype, so we can make inferences about all mice of that genotype. Note that in the experiments in (B), n = 3, no matter how many replicate plates are created. Fundamental principle 1 Science is knowledge obtained by repeated experiment or observation: if n = 1, it is not science, as it has not been shown to be reproducible. You need a random sample of independent measurements. Fundamental principle 2 Experimental design, at its simplest, is the art of varying one factor at a time while controlling others: an observed difference between two conditions can only be attributed to Factor A if that is the only factor differing between the two conditions. We always need to consider plausible alternative interpretations of an observed result. The differences observed in Fig 1 might only reflect differences between the two suspensions, or be due to some other (of the many) differences between the two individual mice, besides the particular genotypes of interest. Fundamental principle 3 A conclusion can only apply to the population from which you took the random sample of independent measurements: so if we have multiple measures on a single suspension from one individual mouse, we can only draw a conclusion about that particular suspension from that particular mouse. If we have multiple measures of the activity of a single vial of cytokine, then we can only generalize our conclusion to that vial. Fundamental principle 4 Although replicates cannot support inference on the main experimental questions, they do provide important quality controls of the conduct of experiments. Values from an outlying replicate can be omitted if a convincing explanation is found, although repeating part or all of the experiment is a safer strategy. Results from an independent sample, however, can only be left out in exceptional circumstances, and only if there are especially compelling reasons to justify doing so. Vaux, et al., 2012
41
A conclusion can only apply to the population from which you took the random sample of independent measurements Fig 3 | Means of replicates compared with means of independent samples. (A) The ratios of the three- replicate Bjm PCR reactions to the three-replicate Actin PCR reactions from the six aliquots of RNA from one culture of HH-CSF-stimulated cells and one culture of unstimulated cells are shown (filled squares). The means of the ratios are shown as columns. The close correlation of the three replicate values (blue lines) indicates that the replicates were created with high fidelity and the pipetting was consistent, but is not relevant to the hypothesis being tested. It is not appropriate to show P-values here, because n = 1. (B) The ratios of the replicate PCR reactions using mRNA from the other cultures (two unstimulated, and two treated with HH-CSF) are shown as triangles and circles. Note how the correlation between the replicates (that is, the groups of three shapes) is much greater than the correlation between the mean values for the three independent untreated cultures and the three independent HH-CSF-treated cultures (green lines). Error bars indicate SE of the ratios from the three independent cultures, not the replicates for any single culture. P > SE, standard error. Fundamental principle 1 Science is knowledge obtained by repeated experiment or observation: if n = 1, it is not science, as it has not been shown to be reproducible. You need a random sample of independent measurements. Fundamental principle 2 Experimental design, at its simplest, is the art of varying one factor at a time while controlling others: an observed difference between two conditions can only be attributed to Factor A if that is the only factor differing between the two conditions. We always need to consider plausible alternative interpretations of an observed result. The differences observed in Fig 1 might only reflect differences between the two suspensions, or be due to some other (of the many) differences between the two individual mice, besides the particular genotypes of interest. Fundamental principle 3 A conclusion can only apply to the population from which you took the random sample of independent measurements: so if we have multiple measures on a single suspension from one individual mouse, we can only draw a conclusion about that particular suspension from that particular mouse. If we have multiple measures of the activity of a single vial of cytokine, then we can only generalize our conclusion to that vial. Fundamental principle 4 Although replicates cannot support inference on the main experimental questions, they do provide important quality controls of the conduct of experiments. Values from an outlying replicate can be omitted if a convincing explanation is found, although repeating part or all of the experiment is a safer strategy. Results from an independent sample, however, can only be left out in exceptional circumstances, and only if there are especially compelling reasons to justify doing so. Vaux, et al., 2012
42
Although cannot support inference, replicates do provide important quality controls of the experiments Fig 4 | Interpreting data from replicates. (A) Mean ± SE of three independent cultures each with ratios from triplicate PCR measurements. P > This experiment is much like the one in Fig 3B. However, notice in this case, for one of the sets of replicates (the circles from one of the HH-CSF-treated replicate values), there is a much greater range than for the other five sets of triplicate values. Because replicates are carefully designed to be as similar to each other as possible, finding unexpected variation should prompt an investigation into what went wrong during the conduct of the experiment. Note how in this case, an increase in variation among one set of replicates causes a decrease in the SEs for the values for the independent HH-CSF results: the SE bars for the HH-CSF condition are shorter in Fig 4A than in Fig 3B. Failure to take note of abnormal variation in replicates can lead to incorrect statistical inferences. (B) Bjm mRNA levels (relative to Actin) for three independent cultures each with ratios from triplicate PCR measurements. Means are shown by a horizontal line. The data here are the same as those for Fig 3B or Fig 4A with the aberrant value deleted. When n is as small as 3, it is better to just plot the data points, rather than showing statistics. SE, standard error. Fundamental principle 1 Science is knowledge obtained by repeated experiment or observation: if n = 1, it is not science, as it has not been shown to be reproducible. You need a random sample of independent measurements. Fundamental principle 2 Experimental design, at its simplest, is the art of varying one factor at a time while controlling others: an observed difference between two conditions can only be attributed to Factor A if that is the only factor differing between the two conditions. We always need to consider plausible alternative interpretations of an observed result. The differences observed in Fig 1 might only reflect differences between the two suspensions, or be due to some other (of the many) differences between the two individual mice, besides the particular genotypes of interest. Fundamental principle 3 A conclusion can only apply to the population from which you took the random sample of independent measurements: so if we have multiple measures on a single suspension from one individual mouse, we can only draw a conclusion about that particular suspension from that particular mouse. If we have multiple measures of the activity of a single vial of cytokine, then we can only generalize our conclusion to that vial. Fundamental principle 4 Although replicates cannot support inference on the main experimental questions, they do provide important quality controls of the conduct of experiments. Values from an outlying replicate can be omitted if a convincing explanation is found, although repeating part or all of the experiment is a safer strategy. Results from an independent sample, however, can only be left out in exceptional circumstances, and only if there are especially compelling reasons to justify doing so. Vaux, et al., 2012
43
Next class: Training on the going through documentation and transparency Use Open Science Framework ( Watch intro video Create an account/log-in with UVA account Come with an experiment about to start, or just started.
44
https://osf.io/ Find this presentation at Questions: tim@cos.io
Questions:
45
Some additional reading:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.