ReproZip: Reproducibility with Ease

ReproZip: Reproducibility with Ease
Rémi Rampin1 | Vicky Steeves2 | Fernando Chirigati1 Juliana Freire1 | Dennis Shasha3 1. NYU Tandon School of Engineering | 2. NYU Division of Libraries & Center for Data Science | 3. NYU Courant Institute Hello everyone! Thanks to Natalie and the DASPOS team for having us! My name is Vicky Steeves, I’m the Librarian for Research Data Management and Reproducibility, a dual appointment between the NYU Division of Libraries and Center for Data Science. My co-presenter here is Remi Rampin, a researcher engineer at the NYU Tandon School of Engineering. Fernando Chirigati, a phD student at NYU Tandon, was supposed to present with us and is very sad to have missed this.

Why Reproducibility? “If I have seen further, it is by standing on the shoulders of giants.” Sir Isaac Newton To build on top of previous work – science is incremental! To verify the correctness of results To defeat self-deception1 To help newcomers To increase impact, visibility2 and research quality3 So before we launch into the “what is ReproZip” part of the presentation, I thought it would be good to first say a bit about why ReproZip was made, what problems it’s trying to solve, and perhaps most importantly -- why we should care. Reproducibility in research has been all over the news and media in the past year, with the most famous case, the Reproducibility Project in Psychology from the Center for Open Science. For those who haven’t heard it, basically, a team of Psych researchers tried to reproduce 100 psych studies from top journals in the field, and found they could only reproduce 37 of them. But why do we care about reproducibility? Well -- for all the reasons listed on this slide and many, many more that I’m sure the other presenters on the schedule will get into. The first and most obvious reason is to verify the correctness of results -- we can’t just say our hypothesis was proved correctly. We need to provide the evidence -- and I’ll go into what that is exactly shortly. By allowing others to reproduce our studies, we also contribute to a greater body of knowledge that feeds into this beautiful, incremental cycle -- read my Isaac Newton quote on the top of the slide. Beyond this, by allowing someone to walk through our workflow, and see and work with our data/code, we are teaching methods to the next generation. Wesley Crusher would have been nothing if he couldn’t reproduce what Data and Jordi did on the Enterprise! 1http:// 2http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf 3http://

The Problem: even if runnable, results may differ
The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements June 1, 2012 We investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions. [...] About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6. Reproducibility has a lot of different problems, with many many solutions being pro- offererd -- hence this wonderful programs of events. ReproZip is all about reproducibility at a computational level -- because even if your code or environment is runnable, that doesn’t mean that it is necessarily reproducible. This study from PLOS One evaluated a popular software package in neuroanatomical science. The authors investigated whether or not the effects of data processing variables such as a software version, hardware, and version of OSX affected the results of the same study. They found significant differences within these variables -- OSX 10.5 and produced vastly different results from each other.

REPRODUCIBILITY P A E R C O D E D A T
Forgive my terrible lack of graphic skills on this one. This is because reproducibility is often thought of as having three pillars: with the paper, the code, and the data, one could reproduce any study given.

REPRODUCIBILITY P A E R C O D E D A T E N V I R.
However, we need to acknowledge the fourth pillar -- the environment -- to have actual reproducibility in science.

Environments are hard to reproduce though...
~~Dependency Hell~~ You can’t just include all the scripts and the data and expect people to be able to run it! Libraries get updated, operating system changes, and software/hardware version & configuration can disrupt reproducibility! As I said before -- environments are hard and cause significant disruptions in reproducibility! This is largely due to the phenomenon called DEPENDENCY HELL! This essentially says that you need to account for the version dependency of a dependency of a dependency for reproducibility -- but this is near impossible to account for manually. These are all environmental variables that can chaos for open scientists!

ReproZip, the Reproducibility Packer!
open, unpack, and reproduce anywhere, anytime! necessary data files, libraries, environment variables, etc. required to reproduce your data analysis So here comes ReproZip to help! ReproZip is a tool that automatically captures provenance of experiments and packs all the necessary ﬁles, library dependencies, and variables to reproduce the results. Reviewers can then unpack and run the experiments without having to install any additional software.

ReproZip: Development History
June 2014 Rémi Rampin makes the first initial revision to ReproZip. November 2014 Integration with VisTrails. September 2015 Version 1.0 is released 2013 Fernando Chirigati wrote the first version of ReproZip & an accompanying research paper. September 2014 Docker plugin to create containers, input/output files detection (insertion + extraction). February 2015 Support for graphical (x11) applications ReproZip was originally developed in 2013 at NYU by the Visualization and Data Analytics lab at NYU Tandon School of Engineering, specifically with Remi Rampin, Fernando Chirigati, Juliana Freire, and Dennis Shasha. I joined the team August of 2015 to begin some outreach initiatives, on and off campus, and to lead trainings and disseminate knowledge on reproducibility. May 21, 2014 Vagrant plugin to create virtual machines from packed experiments Jun 6, 2014 0.2 drop mongodb and systemtap dependencies chroot and vagrant plugins to isolate unpacked experiment remi stripped away a lot of the initial dependencies (mongoDB, systemTap (kernel thing, packer needed it to capture thing)--almost 2GB) he rewrote a lot of it, and wrote more unpackers (initially only had directory unpacker) Sep 15, 2014 0.4 Drive execution from reprounzip directly (no need to use chroot, vagrant, ... directly) input/output files detection and insertion/extraction Docker plugin to create containers, which are faster and lighter than VMs Nov 24, 2014 0.5 VisTrails integration Feb 16, 2015 0.6 Support for graphical (X11) applications Sep 30, 2015 1.0.0

ReproZip can pack: Data analysis scripts / software (any language, you name it!) Graphical tools Interactive tools Client-server scenarios (including databases) MPI experiments … and many more!

ReproZip: Packing & Unpacking in Very Few Steps!
ReproZip: Packing & Unpacking in Very Few Steps! This graphic just shows some of the explicit steps involved in making and unpacking a reproducible package. The left shows the first step, packing. Here the original researcher uses reprozip trace, runs their experiment, and at the conclusion simply packs it into the single package. They can then send that package to colleagues or reviewers, or store it as an archival snapshot. In the unpacking step, the reviewer or collaborator would load in the rpz, optionally upload their own input files to the environment, and reproduce the experiment. They have the option to download the output files for further inspection, and then destroy the VM or container.

Data Analysis; Software; Environment
Researcher Packing Experiments Computational Environment E (Linux) Data Analysis; Software; Environment Emphasize software packages All the files are packed in the same structure they are found in the original environment. The output is a single reproducible rpz file.

Researcher Packing Experiments Computational Environment E (Linux) reprozip Executing Data Analysis; Software; Environment Emphasize software packages All the files are packed in the same structure they are found in the original environment. The output is a single reproducible rpz file.

Researcher Packing Experiments Computational Environment E (Linux) Input files, output files, parameters … Data Executable programs and steps Workflow Environment variables, dependencies, software packages, ... Environment Data Analysis Provenance reprozip Executing Tracing Data Analysis; Software; Environment ptrace + SQLite Emphasize software packages All the files are packed in the same structure they are found in the original environment. The output is a single reproducible rpz file.

Packing Experiments Computational Environment E (Linux) Researcher
Researcher Packing Experiments Computational Environment E (Linux) Input files, output files, parameters … Data Executable programs and steps Workflow Environment variables, dependencies, software packages, ... Environment Data Analysis Provenance reprozip Executing Tracing Data Analysis; Software; Environment ptrace + SQLite Configuration File Creating Configuration Emphasize software packages All the files are packed in the same structure they are found in the original environment. The output is a single reproducible rpz file.

Packing Experiments Computational Environment E (Linux) Researcher
Researcher Packing Experiments Computational Environment E (Linux) Input files, output files, parameters … Data Executable programs and steps Workflow Environment variables, dependencies, software packages, ... Environment Data Analysis Provenance reprozip Executing Tracing Data Analysis; Software; Environment Configuration File Creating Configuration Reproducible Package (.rpz file) Configuring Packing Emphasize software packages All the files are packed in the same structure they are found in the original environment. The output is a single reproducible rpz file.

Unpacking Experiments
Unpacking Experiments Readers Computational Environment E’ (Linux, Windows, OS X) Reproducible Package (.rpz file) Unpacking reprounzip Now that we have an rpz file, we can unpack it in any OS by running reprounzip setup and then reprounzip run with the unpacker of our choice.

Unpacking Experiments
Unpacking Experiments Readers Computational Environment E’ (Linux, Windows, OS X) directory unpacks and reproduces from a single directory Linux Provenance Graph unpacks in a single directory and builds a full system environment Linux chroot / Reproducible Package (.rpz file) Unpacking reprounzip vagrant unpacks in a virtual machine using Vagrant Linux | Mac OS X | Windows VisTrails docker unpacks in a Docker container Linux | Mac OS X | Windows These are all the unpackers we offer currently. The first is the directory unpacker (reprounzip directory) allows users to unpack the entire experiment (including library dependencies) in a single directory, and to reproduce the experiment directly from that directory. It does so by automatically altering environment variables (e.g.: PATH, HOME, and LD_LIBRARY_PATH) and the command line to point the experiment execution to the created directory, which has the same structure as in the packing environment. This is unreliable if the application cannot be trusted, since it can point outside the unpacked directory. Hardcoded paths for example will still hit outside that directory. The next is the chroot unpacker (reprounzip chroot), similar to reprounzip directory, a directory is created from the experiment package; however, a full system environment is also built, which can then be run with chroot(2), a Linux mechanism that changes the root directory / for the experiment to the experiment directory. Therefore, this unpacker addresses the limitation of the directory unpacker and does not fail in the presence of hardcoded absolute paths. Note as well that it does not interfere with the current environment since the experiment is isolated in that single directory. Although chroot offers pretty good isolation, it is not considered completely safe: malicious experiments might still escape to the host environment. Third, the vagrant unpacker (reprounzip vagrant) allows an experiment to be unpacked and reproduced using a virtual machine created through Vagrant. Therefore, the experiment can be reproduced in any environment supported by this tool, i.e., Linux, Mac OS X, and Windows. Lastly, ReproUnzip can extract and reproduce experiments as Docker containers. The docker unpacker (reprounzip docker) is responsible for such integration. Docker implements a high-level API to provide lightweight containers that run processes in isolation. A Docker container, as opposed to a traditional virtual machine, does not require or include a separate operating system. Instead, it relies on the kernel's functionality and uses resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces to isolate the application's view of the operating system, and is thus lighter and faster. ReproZip also allows users to generate a provenance graph related to the experiment execution by reading the metadata available in the .rpz package. This graph shows the experiment runs as well as the files and other dependencies they access during execution; this is particularly useful to visualize and understand the dataflow of the experiment. There is also a VisTrails plugin that creates a VisTrails workflow, that can be used to run the experiment in VisTrails. Easy of use: Users have control over the collected trace and can customize the reproducible package; ReproZip also provides command-line interfaces that make it easier to setup, reproduce, and modify the original experiment

Workflow Visualization with VisTrails
VisTrails integration provides a graphical view of the workflow of an unpacked ReproZip package. VisTrails drives the execution of this unpacked experiment or environment VisTrails can: expose the input and output files make it easier for user to add or remove input files the user can create any other block this could be a script or data file they want to run with the packed experiment/environment VisTrails is an open-source, provenance-aware scientiﬁc workﬂow management system that provides support for exploratory computational tasks, such as simulations, data analysis, and visualization. Vistrails can drive the execution of the unpacked experiment, exposing both input and output files as ports. This provides an easy interface for users to view the experiment’s provenance, change input files, and generally visualize their experiment, and integrate the experiment with other tools and packages supported by VisTrails.

Download the .rpz file: stacked-up.rpz
Stacked Up: Do Philly Kids Have the Books They Need? GitHub Repository: Data: Website: Demo Time! Download the .rpz file: stacked-up.rpz Stacked Up is a website created to explore the textbook inventory of Philadelphia public schools, where citizens can check the book records at neighborhood schools. This web application was written using the Django web framework, and all the data is stored under a local PostgreSQL database. It can easily be run again on different machines or in the cloud.

Current Use Cases Academic Publications ReproZip packages (.rpz) to be included with each publication and cited as data, no different than other datasets. 1st Case: Information Systems journal (Reproducibility Section): Authors included a DOI to their .rpz package in a shared Mendeley repository available to be cited as a dataset. Recommended by the ACM SIGMOD Reproducibility Review ReproZip-Examples: a GitHub repository where users deposit their .rpz packages with directions on how to reproduce. The Bechdel Test Article in FiveThirtyEight - example trying to reproduce published results Listed on the Artifact Evaluation Process Guidelines Bonneau Lab (NYU): Comp. Biology using .rpz to make archival snapshots of research Currently there are a few that are really exciting. In regards to current user base, outside of computer science and data science, there are a few labs on campus which use ReproZip. The first is Rich Bonneau’s computational biology lab. His research focuses on two main categories of computational biology: learning networks from functional genomics data and predicting and modeling protein structure. They are currently using ReproZip to make archival snapshots of their research, keeping them as backups and reproducible packages in case something is severely messed up down the line. Additionally, we are getting some computational journalists set up to use ReproZip to create reproducible packages of their client-server environments and backend database for visualizations and data used to help journalists seek out and create stories via a web interface. For publications, we are hoping that ReproZip will be used and cited as any other dataset, but also that it will be included for reviewers to vet experiments and data before publication. We have the first case of this in the reproducibility section of Information Systems Journal. Additionally, the ACM’s AEC (which manages ingest into the ACM digital library) included ReproZip on their guidelines on how authors should package artifacts for submission, and it is a recommended tool by ACM’s Special Interest Group On Management of Data reproducibility review.

Current & Future Work Distributed experiments (MPI)
Works, though setting up the experiment for reproduction is involved Packing support for OS X Pointing to remote files during reproduction Graphical UI to reproduce without touching the command-line Integration with the Jupyter Notebook tmpnb: if we don't have an MPI experiment on reprozip-example in time for the workshop, please mention that we will soon post detailed examples on how to pack and unpack MPI experiments with ReproZip. "large remote files" is related to identifying and reproducing the experiment when it points to a remote file that is large. Please re-phrase this as you like, but it is a good point to talk about (and it was one of the things that Tanu asked me before). Don't forget to emphasize the usability feature of ReproZip -- e.g.: users don't need to understand vagrant and docker to use ReproZip!

Thank You! Questions? ReproZip Info
Website: GitHub: Examples: Repro News: Contact Info Rémi Rampin: Vicky Steeves: Fernando Chirigati: Mailing list:

Limitations WARNING Only packs experiments in Linux distros
Only detects information about software packages in Debian and Fedora-based environments all the required files are captured regardless of the Linux system Does not allow reproducibility of non-deterministic processes Does not save state Proprietary software… ReproZip can pack it, but share your packages at your own risk! To get around this… Use open source software!!!!!

ReproZip vs. Existing Packing Systems
Packing Systems: CDE, PTU, CARE ReproZip adds important features and contributions: Portability: Linux experiments can be unpacked in different OS’es Extensibility: Developers can easily implement new unpackers for other environments / systems Reusability: ReproZip automatically identifies input files, parameters, and output files, allowing users to easily modify these for reuse purposes Easy of use: Users have control over the collected trace and can customize the reproducible package; ReproZip also provides command-line interfaces that make it easier to setup, reproduce, and modify the original experiment

Workflow & Provenance Graphs [Stacked Up]
$ reprounzip graph --packages drop --otherfiles drop --processes thread prov.dot stacked-up.rpz --packages drop will entirely hide the packages, removing all their files from the graph --otherfiles drop will ignore all the files --processes thread will show every process and thread Stacked Up is a website created to explore the textbook inventory of Philadelphia public schools, where citizens can check the book records at neighborhood schools. This web application was written using the Django web framework, and all the data is stored under a local PostgreSQL database. It can easily be run again on different machines or in the cloud.

ReproZip: Workflow & Provenance Graphs [ENHANCE]
Black Boxes: Run Dark Grey Boxes: Thread

ReproZip: Reproducibility with Ease

Similar presentations

Presentation on theme: "ReproZip: Reproducibility with Ease"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ReproZip: Reproducibility with Ease

Similar presentations

Presentation on theme: "ReproZip: Reproducibility with Ease"— Presentation transcript:

Similar presentations

About project

Feedback