Reproducible research

Reproducible research
C. Tobin Magle, PhD 10:00 – 11:30 a.m. Morgan Library Computer Classroom 175 Hi, and welcome to data and donuts. I’m Tobin Magle, the Cyberinfrastructure facilitator at the Morgan Library at Colorado State University. This session will cover tools that make your research more reproducible.

Outline What is reproducible research? Why would I do that?
How do I do it? Documentation Automation Version Control Sharing data and code Specifically, we’re going to - Define reproducible research - Discuss its advantages - And present some tools that make reproducible research easier

The research cycle Experimental design Data Hypothesis Article Results
Normally, we think of research in the following cycle, from hypothesis generation, experimental design, data collection to data analysis publishing results. Article Results

The research cycle Complications Technological advances:
Hypothesis Experimental design Data Technological advances: Huge, complex digital datasets Ability to share Sharing Requirements: Journals Funding agencies Complications However, modern research methods have introduced complications. 1. New technology produces huge, complex datasets that require automation to analyze. It also gives us the ability to share more than just a manuscript in a print journal. 2. Additionally, Journals and funding agencies are requiring research to make their research more transparent by sharing data and analysis methods in addition to producing peer reviewed manuscripts. Article Results

The research cycle Experimental design Raw data Hypothesis Code
Processing/ Cleaning Open Data Tidy Data Because of these issues, the research cycle looks more complicated: · with multiple complex steps in data processing and analysis · and multiple types of shared research outputs, like data and code. Analysis Article Results

Reproducible Research
The research cycle Hypothesis Experimental design Raw data Code Cleaning Reproducible Research Open Data Tidy Data Thus, documenting your research process and sharing code and data in addition to the research article are all integral parts to reproducible research. Analysis Article Results

Replication vs. Reproducibility
Replication: Same conclusion new study (gold standard) “Again, and Again, and Again …” BR Jasny et. al. Science, (6060) pp DOI: /science Replication isn’t always feasible: too big, too costly, too time consuming, one time event, rare samples Reproducibility: Same results from same data and code (minimum standard for validity) “Reproducible Research in Computational Science”. RD Peng Science, (6060) pp DOI: /science Before we get too far into the details, I want to discuss the difference between reproducibility and replication, because these words are used differently in different fields. For our purposes today, · Replication is answering a research question with a new experiment and coming to the same conclusion, which is the scientific gold standard. · However, replication is not always feasible, such as costly clinical trials, or measuring real world events in real time like climate data. · Reproducibility is the minimum standard to assure scientific validity: providing enough information to get the same results from the same data and the same source code.

Types of Reproducibility
Computational reproducibility: code + data Empirical reproducibility: methods + data Statistical reproducibility: preregistration + statistical details (tests, model parameters, threshold values, etc.) Computational reproducibility: when detailed information is provided about code, software, hardware and implementation details. Empirical reproducibility: when detailed information is provided about non-computational empirical scientific experiments and observations. In practise this is enabled by making data freely available, as well as details of how the data was collected. Statistical reproducibility: when detailed information is provided about the choice of statistical tests, model parameters, threshold values, etc. This mostly relates to pre-registration of study design to prevent p-value hacking and other manipulations

Reproducibility is the practice of distributing all data, software source code, and tools required to reproduce the results discussed in a research publication. So for our purposes, reproducible research is the practice of distributing ALL data, software, source code and tools required to reproduce the results discussed in a research publication.

= Data (with metadata) + Code/Software
Reproducibility = Data (with metadata) + Code/Software Put simply, reproducible research = data + code

Reproducibility = Transparency
And even more simply, reproducible research is about making the process of research more transparent and reliable.

Reproducibility spectrum
I find it useful to think of reproducibility as a spectrum from providing results and a description of the methods in a publication to full replication. There’s no one “right” way to do things, but it’s good to strive to be as close to the right end of the spectrum as you can. “Reproducible Research in Computational Science”. RD Peng Science, (6060) pp DOI: /science

Why do reproducible research?
Show your results are true Let others use your methods (code) and results (data) Public good Good for you Now that we know what reproducible research is, let’s talk about why you might want to adopt reproducible research practices First, it is an excellent way to show that your research results are true. When someone has access to your data and code and can get the same results, it’s hard to deny that what you found is valid It also accelerates science by letting other researchers use your code and data for their own research. With funding being tight, being able to reuse research projects efficently is a great way to use science. It’s also for the public good. Tax payers largely fund our research budgets, and being efficient and transparent about our process increases trust and decreases mistakes. For example, a research group who was looking at cancer drugs had a mistake in their code that reversed the outcomes of drug treatment in their model, resulting in ineffective drugs being given to patients. Another research group who was trying to reproduce their studies figured out what was going on and stopped a harmful clinical trial. Finally, it’s good for you. Research products like public data sets and software are increasingly getting considered for grant applications and citation counts, tripling your research output for one project. And doing things reproducibly makes things easier to repeat, which is good because...

Research is repetitive
Replication Same assay, different samples Longitudinal experiments Research is repetitive by nature. We have to run independent replicates, we do the same assays on different samples, and we run longitudinal experiments.

Doing things by hand is…
Slow Hard to document Hard to repeat When you’re repeating the same thing over and over, doing things by hand is really inefficient. It’s slow from the outset, it’s hard to repeat and document.

Reproducible research in R
Automation Reproducible reports R Studio has tools that make it easier to make your analysis reproducible. We’re going to discuss using R scripts, git and R markdown. Version control

Documenting analysis Raw Data Processed Data Results
Optimal: the instructions should be an automated script file (ie, “code”) Minimum: Written instructions that allow for the complete reproduction of your analysis Let’s talk about documentation The goal of documentation is to accurately report how you went from the raw data that you collected to research results in the form of visualizations and statistical tests. Optimally, these instructions should be in the form of executable code, but at a minimum, a written description of the exact steps you took to clean and analyze your data is necessary. Raw Data Processed Data Results Cleaning/ processing Analysis

Exercise 1: Making graphs
Download files: Open Excel file Describe how to make a bar graph in excel Switch instructions and make a graph Let’s use making graphs in excel as an example. · First, Download the excel spreadsheet linked on this slide · Write a verbal description of how you would make that graph. · Was describing your steps easy? · Do you think someone could make the exact same graph based solely on your instructions?

Automation Fast Easy to document Easy to repeat
Write scripts or save log files By now you’ve probably realized that documenting how to make an excel graph is harder than it looks. You could avoid all that hassle by automating this process. · Learning how code takes time up front, but makes documentation fast and easy to repeat.

Example: Making graphs
Open the .Rproj file Open the .R script file Highlight the text Run the code by hitting Ctrl-Enter I wrote a script to make a similar graph in R. So if my collaborator wants to see how it was done, I can send her the script, and she can run it on her own in about a second. Demo 1: · Download and open the R script file · It opens in R studio · Highlight all the text · run the code (ctrl-enter) · See the graph

Details to record for processing/analysis
What software was used? (R Studio, script) Does it support log files/scripts? (yes!) What version # and settings were used? (R version 3.3.2) What else does the software need to run? Computer architecture OS/Software/tool/add ons (libraries/packages) External databases Easy, right? Unfortunately, it’s not that simple. The R language and its packages change over time, which can cause your code to break. To achieve reproducibility, you have to record additional information about your analysis. - What software did you use? - What version and settings were used?

sessionInfo() R function that specifies environment where code was run
Lists system settings Lists loaded packages Luckily, these questions can be answered using the sessionInfo() command in R studio. Demo 2: · Go to the Console window (lower left) · run the sessionInfo() command · See output in the terminal as you can see it tells me the version of R and the OS it was running in, as well as the packages that were loaded when I ran the script.

Intuitive version control
Can go back if you make a mistake Consistent naming conventions are hard Involves ing back and forth Now we’re going to switch gears and talk about version control. Data analysis typically evolves over the course of a project. Usually people keep a backlog of older versions by saving multiple, similar versions of the same file. If you’re really good, and really consistent this can work fine, but we’re all human and our file names can devolve into chaos when the project gets tough.

Version control tools Keeps documents in one place
Always know what current version is - but can go back Can be automatic This is where having an automated version control system comes in handy. In general these systems: Keep all the versions of a file in one place, typically under one file name THis helps you always know what the current version is, while still allowing you to revert to an older version Many such systems will keep track of it for you so you don’t have to remember a naming system or what you’ve changed. Let’s look at some examples of version control in Google Docs, Open Science Framework and Git. Version control systems

Demo: google docs Open a new google doc Type some stuff
Go to file - version history: show the current doc show the blank doc Name the versions Type more

Demo: OSF https://osf.io/2fdyr/ Go to the OSF project
look at VC-demo.txt look at the revisions Edit VC-demo.txt Look at the text file again.

Demo: git Version control system Remote Repository
Show the git tab Add a line to the script file See that the script file has been added to the git tab Add changes by checking the box Commit them to the local repository Show the remote ( Look at the remote script Push

Exercise Do you currently do version control? If so, how?
Think about the version control tools Think about your research workflow What tool would fit into your current workflow? If none, can you think of alternatives?

Version control works best with...
“Good” formats (text based) “Bad” formats (binary files) Documents (.txt, .tex, .rtf, .Rmd) Tabular Data (.csv) Source code (.R, .py, .c, .sh) Documents (.docx, .pdf, .ppt) Excel spreadsheets (.xlsx) Media (.jpg, .mp3, .mp4) Databases (.mdb, .sqlite) While you can theoretically use version control with any file type, it works best with text based files. But, most documents we use to write scientific papers are in binart formats like .docx. How do I apply version control to manuscripts and reports? How do I apply version control to manuscripts and reports?

R Markdown Weave narrative text and R code
Produce documents in many formats Reproducible You learn a text based format like R Markdown This document type lets you run code and write text interchangably in the same document When you render the text into a document, it can be saved in many binary formats like word and pdf And because it’s text based and version controllable, and runs your code right in the document, it makes your work very reproducible.

Demo: R Markdown Open the R markdown document
Show text and code blocks Knit the document Show the final HTML

Reproducible research checklist
Think about the entire pipeline: are all the pieces reproducible? Is your cleaning/analysis process automated?– guarantees reproducibility Are you doing things “by hand”? editing tables/figures; splitting/reformatting data Does your software support log files or scripts? If no, do you have a detailed description of your process? Are you using version control? Are you keeping track of your software? Computer architecture; OS/Software/tool/add ons (libraries/packages)/external databases version numbers for everything (when available) Are you saving the right files?: if it’s not reproducible, it’s not worth saving Save the data and the code Data + Code = Output Are your reports human and machine readable? Here’s a handy checklist that you can use to determine how reproducible your research workflows are currently. Adapted from:

Exercise: Assess your research Pt 1-assess yourself
Fill out the reproducible research checklist with your own work in mind

Exercise: Assess your research Pt2- Brainstorm
Explain your research (elevator speech) to your table Explain where you’re good at reproducible research Explain where you’re not doing as well Brainstorm ways to fix it

Exercise: Assess your research Pt 3- share with group
Pick a representative to give their elevator speech Explain the good Identify areas of improvement Explain how to improve

Need help? Email: tobin.magle@colostate.edu
Data Management Services website: OSF slides: Reproducibility guide: Thanks for listening. If you need help, you can me, visit our data management services website, or use the online content linked on this slide to learn more. Thanks for coming. I hope this session was useful.

Reproducible research

Similar presentations

Presentation on theme: "Reproducible research"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reproducible research

Similar presentations

Presentation on theme: "Reproducible research"— Presentation transcript:

Similar presentations

About project

Feedback