Download presentation
Presentation is loading. Please wait.
Published byAshlynn Porter Modified over 6 years ago
1
Fostering openness, integrity, and reproducibility of scientific research
April Center for Open We have an opportunity in front of us to make some real changes in how science is done. I’m going to talk about a slightly different approach to supporting data sharing requirements, and one which is made possible through collaborations and partnerships with others, including perhaps most prominently, those who manage and preserve scientific knowledge: publishers and librarians. COS is a non-profit technology company providing free and open services to increase inclusivity and transparency of research. COS supports shifting incentives and practices to align more closely with scientific values.
2
Technology to enable change Training to enact change
Incentives to embrace change Supporting these behavioral changes requires improving the full scientific ecosystem.
3
Technology to enable change Training to enact change
Incentives to embrace change Supporting these behavioral changes requires improving the full scientific ecosystem.
4
Free training on how to make research more reproducible
Partner with others on training --- librarians are great partners in this ---- to teach researchers skills in how to deal with basic data management and how to improve their research workflows for personal and sharing purposes. Software Carpentry and Data Carpentry are other great examples of efforts in this area, and partnerships with those in libraries --- we’ve done some work with them and are exploring ways to do more. Free training on how to make research more reproducible
5
Reproducible statistics in the health sciences
April Clyburne-Sherin Reproducible Research Evangelist
6
Reproducible statistics in the health sciences
Today’s topics p-values Effect sizes and confidence intervals Power Reproducibility Reporting Bias Research degrees of freedom Preregistration Open Science Framework
7
Reproducible statistics in the health sciences
Learning objectives A p-value is not enough to establish clinical significance Effect size plus confidence intervals work better together Low powered studies produce inflated effect sizes Low powered studies produce low chance of finding true positives The findings of many studies cannot be reproduced All findings should be reported Confirmatory analyses should to be distinguished from exploratory Preregistration is a simple solution for reproducible statistics
8
P-values What is a p-value?
The probability of getting your data if there is no treatment effect p‐level of α =0.05 means there is a 95% probability that the researcher will correctly conclude that there is no treatment effect when there is really is no treatment effect
9
P-values What is a p-value? Generally leads to dichotomous thinking
Either something is significant or it is not Influenced by the number and variability of subjects Changes from one sample to the next
10
The dance of the p-values
from, ‘the dance of the p-values’ youtube video by geoff cummings The p-value is less erratic when you study has more power, but it is still highly variable The variability of the p-value isn’t conveyed in the p-value itself Knowing what the p-value was in one sample give you almost no information about what it will be in another sample
11
P-values A p-value is not enough to establish clinical significance
Missing clinical insight such as treatment effect size, magnitude of change, or direction of the outcome Clinically significant differences can be statistically insignificant Clinically unimportant differences can be statistically significant Missing clinical insight into important variables such as treatment effect size, magnitude of change, or direction of the outcome clinically important differences observed in studies can be statistically non-significant as a result of having a small number of subjects studied even the smallest difference in measurements can be proved statistically significant by increasing the number of subjects in a study significant differences could be irrelevant to patients or clinicians if the difference is not clinically important
12
P-values A p-value is not enough to establish clinical significance
P-values should be considered along with Effect size Confidence intervals Power Study design
13
Effect Size A measure of the magnitude of interest, tells us ‘how much’ Generally leads to thinking about estimation, rather than a dichotomous decision about significance Often combined with confidence intervals (CIs) to give us a sense of how much uncertainty there is around our estimate
14
Confidence Intervals Provide a ‘plausible’ range for effect size in the population In 95% of the samples you draw from a population, the interval will contain the true population effect Not the same thing as saying that 95% of the sample ES will fall within the interval Can also be used for NHST if 0 falls outside of the CI, then your test will be statistically significant on average about 83% of the point estimates of replications will fall within teh CI of a given draw
15
How to calculate CIs SPSS Excel options R
Limited options (e.g. CIs around mean difference) Excel options ESCI sheets from Geoff Cummings R MBESS package MBESS will also allow you to perform a precision analyse, the number of people you woul dneed to get CIs or a particular width
16
Better together Why should you always report both effect sizes and CIs? Effect sizes, like p-values, are bouncy Point estimate can convey an invalid sense of certainty about your ES CIs give you additional information about the plausible upper and lower bounds of bouncing ESs
17
Better together
18
So why use the ‘new statistics’?
Give you more fine grained information about your data point estimates, plausible values, and uncertainty Give more information for replication attempts Used for meta-analytic calculations, so are more helpful for accumulating knowledge across studies even if youare going to use p-values, knowing the plausible values for hte ES helps you out wiht power calculations
19
Low powered studies still produce inflated effect sizes
If I use ES and CIs rather than p-values, do I still have to worry about sample size? Underpowered studies tend to over-estimate ES Larger samples will lead to better estimation of the ES and smaller CIs They will have higher levels of precision
20
Precision isn’t cheap To get high precision (narrow CIs) in any one study, you need large samples Example: You need about 250 people to get an accurate, stable estimate of the correlation of a typical size in psychology
21
Precision isn’t cheap
22
Alternative Approach?
23
Power in Neuroscience Button et al. (2013)
Median power about 30%, typical social psychology experiemt has 40% Figure 3 | Median power of studies included in neuroscience meta-analyses. The figure shows a histogram of median study power calculated for each of the n = 49 meta-analyses included in our analysis, with the number of meta-analyses (N) on the left axis and percent of meta-analyses (%) on the right axis. There is a clear bimodal distribution; n = 15 (31%) of the meta-analyses comprised studies with median power of less than 11%, whereas n = 7 (14%) comprised studies with high average power in excess of 90%. Despite this bimodality, most meta-analyses comprised studies with low statistical power: n = 28 (57%) had median study power of less than 31%. The meta-analyses (n = 7) that comprised studies with high average power in excess of 90% had their broadly neurological subject matter in common. Simultaneously, across disciplines, the average power of studies to detect positive results is quite low (Button et al., 2013; Cohen, 1962; Ioannidis, 2005). In neuroscience, for example, Button et al. observed the median power of studies to be 21% (Button et al., 2013), which means that assuming the finding being investigated is true and accurately estimated, then only 21 of every 100 studies investigating that effect would detect statistically significant evidence for the effect. Most studies would miss detecting the true effect. The implication of very low power is that the research literature would be filled with lots of negative results, regardless of whether the effects actually exist or not. In the case of neuroscience, assuming all investigated effects in the published literature are true, only 21% of the studies should have obtained a significant, positive result detecting that effect. However, Fanelli observed a positive result rate of 85% in neuroscience (Fanelli, 2010). This discrepancy between observed power and observed positive results is not statistically possible. Instead, it suggests systematic exclusion of negative results (Greenwald, 1975) and possibly the exaggeration of positive results by employing flexibility in analytic and reporting practices that inflate the likelihood of false positives (Simmons et al., 2011). Button et al. (2013)
24
Low powered studies mean low chance of finding a true positive
Low replicability due to power: 16% chance of finding the effect twice Inflated effect size estimates Decreased likelihood of true positives Make sure to frame this as the researcher replicating their own work (study 2, replication and extend) – this isn’t just a direct replication, even if it’s a conceptual replication, will happen 16% of the time The typical study in psychology has approximately 41% power, meaning you will miss a real effect 59 times out of 100. Now, this is pretty bad for an individual study, but if you do study 1, find a significant result, then decide you want to try and replicate the finding, you chances of getting the significant twice is: 16% Even if you’re not doing an ‘official’ power analysis (an informal one in your head, I got it with 20 so I should only need 20) you will still be underestimating needed sample size
25
Figure 1. Positive Results by Discipline.
There is evidence that our published literature is too good to be true. Daniele Fanelli did an analysis of what gets published across scientific disciplines and found that all disciplines had positive result rates of 70% or higher. From physics through psychology, the rates were 85-92%. Consider our field’s 92% positive result rate in comparison to the average power of published studies. Estimates suggest that the average psychology study has a power of somewhere around .5 to .6 to detect its effects. So, if all published results were true, we’d expect somewhere between 50-60% of the critical tests to reject the null hypothesis. But we get 92%. That does not compute. Something is askew in the accumulating evidence. [It is not in my interest to write up negative results, even if they are true, because they are less likely to be published (negative) – file-drawer effect] The accumulating evidence suggests an alarming degree of mis-estimation. Across disciplines, most published studies demonstrate positive results – results that indicate an expected association between variables or a difference between experimental conditions (Fanelli, 2010, 2012; Sterling, 1959). Fanelli observed a positive result rate of 85% in neuroscience (Fanelli, 2010). Fanelli D (2010) “Positive” Results Increase Down the Hierarchy of the Sciences. PLoS ONE 5(4): e doi: /journal.pone
26
What is reproducibility?
Scientific method Observation Question Hypothesis Prediction Testing Analysis Systematic accumulation of knowledge Reproducible findings 3) accumulate knowledge: science as a whole is slowly accumulating knowledge. Individual studies are pieces of evidence that may support or undermine a given theory, and overtime these individual pieces of evidence should accumulate into a greater understanding of what theories and phenomena are true, what is really going on in the world around us? We want to believe that science is accumulating knowledge about interesting, real phenomena But is that really the case? How much can we trust the knowledge that has been accumulated based on published findings?
27
What is reproducibility?
Observation Question Hypothesis Prediction Testing Analysis Reproduction of entire study or findings Requires transparency of methods Focuses on the validity of the findings Minimum standard for any scientific study Reproducibility: How well scientific findings can be reproduced using study materials. A study is considered reproducible when the original findings were produced again when the analyses were repeated.
28
Unfortunately, it has become apparently over the last few years that perhaps the answer to that question is not all that much. Now, there have been some very prominent cases in the past few years of outright fraud, where people have completely fabricated their data, but I’m not talking about those case. What I’m talking is the general sense that many scientific findings in a wide variety of fields don’t replicate, and that the published literature has a very high rate of false-positives in it. So if a large proportion of our published results aren’t replicable, and are potential false positives, Are we actually accumulating knowledge about real phenomena? I would suggest that hte answer to this question is ‘no’, or at least we haven’t accumulated as much knowledge as we would like to believe.
29
Open Science Collaboration, 2015, Science
30
The findings of many studies cannot be reproduced
Why should you care? To increase the efficiency of your own work Hard to build off our own work, or work of others in our lab We may not have the knowledge we think we have Hard to even check this if reproducibility low
31
Current barriers to reproducibility
Statistical Low power Researcher degrees of freedom Ignoring null results Transparency Poor documentation Loss of materials and data Infrequent sharing
32
Researcher Degrees of Freedom
All data processing and analytical choices made after seeing and interacting with your data Should I collect more data? Which observations should I exclude? Which conditions should I compare? What should be my primary outcome? Should I look for an interaction effect?
33
False positive inflation
this is also assuming your initial false positive rate was .05, which may not be true given that we often work with somewhat unreliable measures, low powered studies, and don’t often treat sitmuli as random factors which can all also increase false positive rates Now you may be saying to yourself, well, people don’t really do this do they? Simmons, Nelson, & Simonsohn (2012)
34
Why does reporting matter?
35
Why does reporting matter?
“High consumption of green jelly beans was significantly associated with Acne vulgaris diagnoses (p < 0.05).”
37
Used by Clinicians Researchers Systematic reviewers Policy makers Ethics boards “High consumption of green jelly beans was significantly associated with Acne vulgaris diagnoses (p < 0.05).”
38
“Good reports are all alike; every poor report is poor in its own way
Non-reporting Selective reporting Incomplete reporting Misleading reporting Ambiguous reporting If stakeholders are unable to understand exactly what was done in the study, or what was found by the study, or to utilize the results of the study, due to reporting than the study is poorly reported. Poor reporting ranges from complete non-reporting to ambiguous and unclear reporting. Non-reporting is when the study is not published or reported in any way, making the study completely inaccessible and unusable
39
Why does reporting matter?
Non-reporting Publication bias Selective reporting Outcome reporting bias Incomplete reporting Research waste Misleading reporting Spin Ambiguous reporting Ambiguity They all have important implications.
40
Why does reporting matter?
Non-reporting Publication bias Selective reporting Outcome reporting bias Incomplete reporting Research waste Misleading reporting Spin Ambiguous reporting Ambiguity But we are only going to discuss non-reporting and selective reporting today in the context of understanding the results of an report.
41
“Good reports are all alike; every poor report is poor in its own way
Non-reporting “What jelly bean study?” If stakeholders are unable to understand exactly what was done in the study, or what was found by the study, or to utilize the results of the study, due to reporting than the study is poorly reported. Poor reporting ranges from complete non-reporting to ambiguous and unclear reporting. Non-reporting is when the study is not published or reported in any way, making the study completely inaccessible and unusable Selective reporting is when certain aspects in the study are intentionally not reported due to their nature Incomplete reporting is when certain aspects in the study are not reported due to ignorance of the need to report them Misleading reporting is when aspects of the study are reported completely but not accurately Ambiguous reporting is when what is reported in one section or document is inconsistent or contradicts what is reported in another section or document
42
Why does non-reporting matter?
Publication bias Positive trials are 2 times more likely to be published than negative trials 1/3 to 1/2 of trials are not published. Publication bias Bias with regard to what is likely to be published among what is available to be published Eg, bias against publication of negative findings or inconclusive findings Studies with negative findings are 2x less likely to be published than studies with positive findings The most common reason is the researchers’ declining to submit negative studies for publication Therefore, register your trial and report your results! If it is worth doing, it is worth reporting. Parekh, S., et al. Dissemination and publication of research findings: an updated review of related biases. Prepress Projects Limited, 2010. Macleod, Malcolm R., et al. "Biomedical research: increasing value, reducing waste." The Lancet (2014):
43
Why does non-reporting matter?
Publication bias Publication bias Bias with regard to what is likely to be published among what is available to be published Eg, bias against publication of negative findings or inconclusive findings Studies with negative findings are 2x less likely to be published than studies with positive findings The most common reason is the researchers’ declining to submit negative studies for publication Therefore, register your trial and report your results! If it is worth doing, it is worth reporting. Whittington, Craig J., et al. "Selective serotonin reuptake inhibitors in childhood depression: systematic review of published versus unpublished data." The Lancet (2004):
44
“Good reports are all alike; every poor report is poor in its own way
Selective reporting “We studied green jelly bean consumption and acne. Just the green ones. Really.” Selective reporting is when certain aspects in the study are intentionally not reported due to their nature
45
Why does selective reporting matter?
Outcome reporting bias 62% of trials had at least one primary outcome changed, introduced or omitted 50%+ of pre-specified outcomes not reported Outcome reporting bias Selective reporting of some outcomes but not others depending on the nature or the direction of results Eg, of all measured outcomes, only the significant ones are reported Eg, the non-reporting of harm outcomes Eg, a secondary outcome is reported as a primary outcome Comparisons of protocols with publications showed that 62% of studies had at least one primary outcome changed, introduced or omitted. ( Best to pre-specify all outcomes in the protocol to allow readers to confirm that all outcomes are reported Chan, An-Wen, et al. "Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles." Jama (2004): Macleod, Malcolm R., et al. "Biomedical research: increasing value, reducing waste." The Lancet (2014):
46
Why does selective reporting matter?
Outcome reporting bias Response from a trialist who had analysed data on a prespecified outcome but not reported them “When we looked at that data, it actually showed an increase in harm amongst those who got the active treatment, and we ditched it because we weren’t expecting it and we were concerned that the presentation of these data would have an impact on people’s understanding of the study findings. … The argument was, look, this intervention appears to help people, but if the paper says it may increase harm, that will, it will, be understood differently by, you know, service providers. So we buried it.” Outcome reporting bias Selective reporting of some outcomes but not others depending on the nature or the direction of results Eg, of all measured outcomes, only the significant ones are reported Eg, the non-reporting of harm outcomes Eg, a secondary outcome is reported as a primary outcome Comparisons of protocols with publications showed that 62% of studies had at least one primary outcome changed, introduced or omitted. ( Best to pre-specify all outcomes in the protocol to allow readers to confirm that all outcomes are reported Smyth, R. M. D., et al. "Frequency and reasons for outcome reporting bias in clinical trials: interviews with trialists." Bmj 342 (2011): c7153.
47
Reporting guidelines SPIRIT CONSORT SAMPL TIDIER
Enhancing the Quality and Transparency Of Health Research SPIRIT Standard Protocol Items: Recommendations for Interventional Trials CONSORT Consolidated Standards of Reporting Trials SAMPL Statistical analyses and methods in the published literature TIDIER Better reporting of interventions: template for intervention description and replication The EQUATOR network acts as a database for reporting guidelines and a resource for authors looking for guidance with reporting The SPIRIT reporting guideline is an evidence-based guidance for writing your study protocol. It mirrors the CONSORT reporting guideline for study reports in many of its items. Therefore, by writing a well reporting protocol, you are helping yourself write a well-reported study report when the study is complete. Two reporting guidelines that are especially important for reproducible research are SAMPL and TIDIER SAMPL provides a checklist of items needed to fully report the statistical analyses and methods in a study report for all commonly used statistical tests TIDIER provides a checklist of items needed to fully report an intervention to allow its replication both from a protocol and from a trial report Both of these can be used to enhance other study specific reporting guidelines to ensure that the methods are transparent enough for replication and reproduction
48
Solution: Pre-registration
Before conducting a study registering: The what of the study Research question Population and sample size General design Pre-analysis plan Information on exact analysis that will be conducted
49
Positive Result Rate dropped from 57% to 8% after preregistration required.
50
Pre-registration Study pre-registration decreases file-drawer effects
Helps with discovery of unpublished, usually null findings Registration of pre-analysis plans help decrease RDF
51
Pre-registered analyses
Before data is collected, specify Sample size Data processing and cleaning procedures Exclusion criterion Statistical Analyses Registered in a read-only format so it can’t be changed Decreases RDF, so p-values are more face valid Registration holds you accountable to your self and to others; similar to model used in clinical trials
52
Pre-registration in the health sciences
53
The Pre-registration Challenge
Educational content and the OSF workflow can be seen by going to (will go to staging server Wednesday/Thursday). Promotional video is here: One thousand scientists will win $1,000 each for publishing the results of their preregistered research.
54
Technology to enable change Training to enact change
Incentives to embrace change Supporting these behavioral changes requires improving the full scientific ecosystem.
55
http://osf.io/ free, open source
Share data, share materials, show the research process – confirmatory result make it clear, exploratory discovery make it clear; demonstrate the ingenuity, perspiration, and learning across false starts, errant procedures, and early hints – doesn’t have to be written in painstaking detail in the final report, just make it available. free, open source
56
Put data, materials, and code on the OSF
57
Manage access and permissions
58
Automate versioning With hashes
59
Get a persistent identifier
60
Register a project
61
Share your work
62
See the impact File downloads
63
OSF Extend beyond core features by connecting to other tools and services in the workflow. This allows for more incremental change while making significant gains in automation, efficiency, and reproducibility. Immediate value from this reinforces more changes.. Now
64
OSF OpenSesame Soon 29 grants to develop open tools and services:
65
Also OSF for Meetings and
OSF for Institutions
66
Technology to enable change Training to enact change
Incentives to embrace change Supporting these behavioral changes requires improving the full scientific ecosystem.
67
Making open practices visible to promote adoption
Open Badges Making open practices visible to promote adoption Badges Open Data Open Materials Preregistration Psychological Science (Jan 2014)
68
Transparency and Openness Promotion (TOP) Guidelines
Data citation Design transparency Research materials transparency Data transparency Analytic methods (code) transparency Preregistration of studies Preregistration of analysis plans Replication Low barrier to entry Modular Agnostic to discipline
69
Transparency and Openness Promotion (TOP) Guidelines
535+ journals and 56+ organizations
70
Technology to enable change Training to enact change
Incentives to embrace change All of these efforts go much deeper and allow for development of partnerships in many more directions.
71
Questions: contact@cos.io
Find this presentation at Questions:
72
Need further help? Help researchers implement new best practices
consulting one-on-one google hangouts online workshops Sign-up for our mailing list and/or follow us on twitter long term vision is for the community to own these things. We want the community to take ownership of these and move them forward. We instigate Talking points: 1) In our building of community we aim to unite initiatives and researchers so that there is a more engaged, helpful community and greater use and access to open science practices. 2) Better services and new rewards will lead to a more active scientific community.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.