Using Provenance to Enable Reproducible Science Juliana Freire NYU Poly.

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

Configuration management
Software change management
Configuration management
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Earth System Curator Spanning the Gap Between Models and Datasets.
The Virtual Estuary: Simulation meets Visualization Yvette Spitz Scott Durski Erik Anderson Joel Daniels Juliana Freire Claudio Silva Antonio Baptista.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Automated Analysis and Code Generation for Domain-Specific Models George Edwards Center for Systems and Software Engineering University of Southern California.
Requirements Specification
VisTrails: Overview Juliana Freire University of Utah Joint work with: Erik Andersen, Steven P. Callahan, David Koop, Emanuele.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Introducing Symposia : “ The digital repository that thinks like a librarian”
AgriDrupal - a “suite of solutions” for agricultural information management and dissemination, built on the Drupal CMS; - the community of practice around.
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
Chapter 1 Overview of Databases and Transaction Processing.
This chapter is extracted from Sommerville’s slides. Text book chapter
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
Using Provenance to Support Real-Time Collaborative Design of Workflows Tommy Ellkvist 1, Erik Anderson 2, David Koop 2, Juliana Freire 2, and Claudio.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Discussion and conclusion The OGC SOS describes a global standard for storing and recalling sensor data and the associated metadata. The standard covers.
The Data Attribution Abdul Saboor PhD Research Student Model Base Development and Software Quality Assurance Research Group Freie.
Providing Access to Your Data: Tracking Data Usage Robert R. Downs, PhD NASA Socioeconomic Data and Applications Center (SEDAC) Center for International.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Metadata Creation with the Earth System Modeling Framework Ryan O’Kuinghttons – NESII/CIRES/NOAA Kathy Saint – NESII/CSG July 22, 2014.
Lecture 01: Introduction September 5, 2012 COMP Visual Analytics and Provenance.
Providing Access to Your Data: Tracking Data Usage Robert R. Downs, PhD NASA Socioeconomic Data and Applications Center (SEDAC) Center for International.
DM_PPT_NP_v01 SESIP_0715_AJ HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann Gerd Heber, John Readey, Joel Plutchak The HDF Group HDF.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
10/5/2015CS346 PHP1 Module 1 Introduction to PHP.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Upgrading to IBM Cognos 10
1 Peter Fox Xinformatics 4400/6400 Week 11, April 16, 2013 Information Audit and dealing with Unstructured Information.
E-Science for the SKA WF4Ever: Supporting Reuse and Reproducibility in Experimental Science Lourdes Verdes-Montenegro* AMIGA and Wf4Ever teams Instituto.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
The european ITM Task Force data structure F. Imbeaux.
NanoHUB.org and HUBzero™ Platform for Reproducible Computational Experiments Michael McLennan Director and Chief Architect, Hub Technology Group and George.
K.Furukawa, Nov Database and Simulation Codes 1 Simple thoughts Around Information Repository and Around Simulation Codes K. Furukawa, KEK Nov.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Virtual techdays INDIA │ august 2010 ENTERPRISE CONTENT MANAGEMENT WITH SHAREPOINT 2010 Naresh K Satapathy │ Solution Specialist, Microsoft Corporation.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
The Astronomy challenge: How can workflow preservation help? Susana Sánchez, Jose Enrique Ruíz, Lourdes Verdes-Montenegro, Julian Garrido, Juan de Dios.
ReproZip Packing Experiments for Sharing and Publication Fernando Chirigati, Juliana Freire | NYU-Poly Dennis Shasha | NYU.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
VisTrails Second Provenance Challenge Tommy Ellkvist David Koop Juliana Freire Joint work with: Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos.
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
Working with your archive organization: Broadening your user community Robert R. Downs, PhD Socioeconomic Data and Applications Center (SEDAC) Center for.
T EST T OOLS U NIT VI This unit contains the overview of the test tools. Also prerequisites for applying these tools, tools selection and implementation.
Chapter 1 Overview of Databases and Transaction Processing.
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,
ReproZip: Computational Reproducibility With Ease
An Overview of Data-PASS Shared Catalog
Juliana Freire, Norbert Fuhr, Andreas Rauber
Publishing software and data
Enhancing Scholarly Communication with ReproZip
Maintaining software solutions
Working with your archive organization Broadening your user community
Lecture 1: Multi-tier Architecture Overview
Manuscript Transcription Assistant Initiative
About Thetus Thetus develops knowledge discovery and modeling infrastructure software for customers who: Have high value data that does not neatly fit.
Automated Analysis and Code Generation for Domain-Specific Models
Presentation transcript:

Using Provenance to Enable Reproducible Science Juliana Freire NYU Poly

2 Dagstuhl 2012 Juliana Freire Science and Reproducibility u Reproducibility is the cornerstone of science

3 Dagstuhl 2012 Juliana Freire Science and Reproducibility

4 Dagstuhl 2012 Juliana Freire Science and Reproducibility u Reproducibility is the cornerstone of science u Science is a self-correcting process---many hypotheses are initially wrong u Without reproducibility, people die! John Wilbanks, AMPS Workshop on Reproducibility 2011

5 Dagstuhl 2012 Juliana Freire Computational Science and Reproducibility u Many disciplines have been transformed by computation: data and computing-intensive –Computations can (often) be reproduced! u Publications also need to change and include a complete and trustworthy scientific record u Authors are being encouraged (or required) to submit reproducible results –ACM SIGMOD, VLDB, Journal of Biostatistics, Science, IEEE Transactions on Signal Processing, Freedom of Information Act and climate science

6 Dagstuhl 2012 Juliana Freire Computational Science and Reproducibility u Many disciplines have been transformed by computation: data and computing-intensive u Publications are changing: need to provide a complete and trustworthy scientific record u Authors are being encouraged to submit reproducible results –ACM SIGMOD, VLDB, Journal of Biostatistics, Science, IEEE Transactions on Signal Processing, …

7 Dagstuhl 2012 Juliana Freire Computational Science and Reproducibility u Many disciplines have been transformed by computation: data and computing-intensive u Publications are changing: need to provide a complete and trustworthy scientific record –Results must be reproducible u Authors are being encouraged to publish reproducible results –ACM SIGMOD, VLDB, Journal of Biostatistics, Science, IEEE Transactions on Signal Processing, Freedom of Information Act and climate science –ETH’s ethics guidelines –NSF data policies

8 Dagstuhl 2012 Juliana Freire Reproducible Publications: Benefits u Produce more re-usable knowledge---not just text u Allow scientists to stand on the shoulders of giants and on their own shoulders! u Science can move faster – u Higher-quality publications –Authors will be more careful –Many eyes to check results u Describe more of the discovery process: learn from successes and mistakes u Expose scientific community to different techniques and tools: expedite their training, reduce time to insight u More impact, more citations

9 Dagstuhl 2012 Juliana Freire A Reproducible Paper ALPS 2.0 matplotlib Libraries Simulation Results Data Reproducible result in paper and on the Web Provenance Workflow [Freedman et al., Phys. Rev. 2012]

10 Dagstuhl 2012 Juliana Freire A Provenance-Rich Paper ALPS 2.0 matplotlib Libraries Simulation Results Data Reproducible result in paper and on the Web Provenance Workflow [Freedman et al., Phys. Rev. 2012]

11 Dagstuhl 2012 Juliana Freire A Reproducible Paper: ALPS2.0 [Bauer et al., JSTAT 2011]

12 Dagstuhl 2012 Juliana Freire Reproducible Publications: Challenges u It is hard for authors to package and publish experiments u It is hard for reviewers to run and evaluate experiments u Level of reproducibility: –Full reproducibility may not be possible, or needed! –Depth: figures  scripts  data  experiments  source code –Portability: original environment  …  different environments u Requirements: –Represent computational experiments –Capture environment information (OS, library versions,…) –Large (or proprietary) data –Special hardware –…

13 Dagstuhl 2012 Juliana Freire Reproducibility Infrastructure u Working with scientists and real requirements –SIGMOD authors –Physicists (Quantum and Astronomy) –Climate scientists – UVCDAT M. Troyer, ETH T. Maxwell, NASA

14 Dagstuhl 2012 Juliana Freire Reproducibility Infrastructure u Working with scientists and real requirements –SIGMOD authors –Physicists (Quantum and Astronomy) –Climate scientists – UVCDAT u Started with VisTrails –Data exploration and visualization tool that provides support for provenance capture and re-use u Depth: from workflow, to libraries, to source code u Portability: Workflow is not enough! [Koop et al., ICCS 2011]

15 Dagstuhl 2012 Juliana Freire Reproducibility Infrastructure: Portability u Need ‘more’ provenance: computational environment (OS, library versions, etc.) –Also use virtual machines, CDEPack u Need better file management

16 Dagstuhl 2012 Juliana Freire Provenance Links to Data 16 DATAFIGURE SIMULATIONANALYSIS Which workflow derived the file? Has the data changed? Which parameters were used? input.txt figure.png

17 Dagstuhl 2012 Juliana Freire Identification Methods u Upstream signature –Identify workflow outputs by the computational steps and parameters that (may have) had an effect on the output u Content hashing –Identify files by their content –Mirrored by version information from version control systems u Universally Unique Identifiers (UUID) –Serve as “filenames” –Could use some other scheme here

18 Dagstuhl 2012 Juliana Freire Strong Links: Workflows and Outputs Identify computational steps and parameters that had an effect on the output Also useful for caching

19 Dagstuhl 2012 Juliana Freire Linking Data to Computations 19 newfilename.dat HASH CONTENTS QUERY FILE STORE OBTAIN FILE REFERENCE 12ab3-45ef2... QUERY PROVENANCE 0ab678cd... OBTAIN WORKFLOWS Content hashing: Identify files by their content Universally Unique Identifiers (UUID) serve as file names

20 Dagstuhl 2012 Juliana Freire Linking Computations to Data QUERY PROVENANCE OBTAIN INPUT REFS 12ab3-45ef2... QUERY FILE STORE 12ab3-45ef2... OBTAIN INPUT FILES input files

21 Dagstuhl 2012 Juliana Freire u Workflow is not enough u Need ‘more’ provenance: computational environment (OS, library versions, etc.) –Also use virtual machines, CDEPack u Need better file management –Designed support for strong links between data and their provenance –Use versioning servers (e.g., GIT, SVN, Oracle DBFS) –Implemented in VisTrails (Persistence Package) u Portability, maintenance and longevity –Workflows need to be run in different environments –Support for different execution models (local, remote, mixed) –Software evolves: need to upgrade workflows/experiments Reproducibility Infrastructure: Portability

22 Dagstuhl 2012 Juliana Freire Workflow Upgrades matplotlib 1.0, csv 0.1matplotlib 1.1, csv 1.0

23 Dagstuhl 2012 Juliana Freire Workflow Upgrades 23 u Implementation Change u Interface Change u Deprecation, Addition, or Replacement def compute(self): fig = pylab.figure()... pylab.scatter(x_data, y_data)... def compute(self): fig = pylab.figure() pylab.setp(fig, facecolor=′w′)... pylab.scatter(x_data, y_data)...

24 Dagstuhl 2012 Juliana Freire Our Approach u Maintain information about library versions u Identify all incompatible modules in the workflow u Group all incompatible modules by package u For each module group: –If developer added logic for upgrades, use those routines –Otherwise try to perform automatic upgrades on all modules –On failure, show the incompatible workflow, notify the user, and allow the user to make changes u Learn from ‘fixes’ –VisTrails keeps provenance of workflow evolution –Use provenance of manual upgrades to automatically patch other workflows 24

25 Dagstuhl 2012 Juliana Freire u Connect results to their provenance –Support LateX, Word, Powerpoint, HTML, wiki u Support for reviewers Reproducibility Infrastructure \begin{figure} \vistrail[filename=ladder_dyl_gap_theta- 2.xml,version=5,pdf, buildalways, getvtl, embedworkflow, execute]{width=8cm} \caption{(color online) Ground-state degeneracy splitting of the non-Hermitian doubled Yang-Lee model when perturbed by a string tension (θ ̸ = 0).} \label{fig:figure} \end{center} \end{figure} \begin{figure} \vistrail[filename=ladder_dyl_gap_theta- 2.xml,version=5,pdf, buildalways, getvtl, embedworkflow, execute]{width=8cm} \caption{(color online) Ground-state degeneracy splitting of the non-Hermitian doubled Yang-Lee model when perturbed by a string tension (θ ̸ = 0).} \label{fig:figure} \end{center} \end{figure} \vistrail[host=alps.ethz.ch,db=vistrails,vtid=10,v ersion=169,pdf]{width=8cm} Local Remote

26 Dagstuhl 2012 Juliana Freire u Connect results to their provenance –Support LateX, Word, Powerpoint, HTML, wiki u Support for reviewers –Explore parameters and configurations, keep provenance of this too! –Provenance as means for reviewers and authors to communicate u Enable interaction with results: the VisMashup system –Publish using different media, not just documents Reproducibility Infrastructure

27 Dagstuhl 2012 Juliana Freire Workflow Mashups

28 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011] –Pilot started in now a permanent feature at the conference u New at VLDB 2013! u IEEE VisWeek (work in progress) u Quantum Physics community u Astrophysics u Climate Science

29 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011] –Since 2008 –Papers submitted for reproducibility evaluation: submissions; submissions –In 2011, lay out a set of guidelines to simplify and expedite the reviewing process; provided tutorials –Review was still challenging »Common problem: setup failed due to implicit dependencies –Reasons for not submitting: »Intellectual property rights on software »Sensitive data and specific hardware requirements research_repeatability.shtml

30 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011] –Since 2008 –Papers submitted for reproducibility evaluation: submissions; submissions –In 2011, lay out a set of guidelines to simplify and expedite the reviewing process; provided tutorials –Review was still challenging »Common problem: setup failed due to implicit dependencies –Reasons for not submitting: »Intellectual property rights on software »Sensitive data and specific hardware requirements research_repeatability.shtml

31 Dagstuhl 2012 Juliana Freire Data Life Cycle Obtain Data Analyze/ Visualize Analyze/ Visualize User studies Sensors Web Databases Simulations Publish/S hare Sequencing machines Particle colliders AVS Taverna VisTrails

32 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science Obtain Data Analyze/Visualize Publish/Share Provenance Collaborate

33 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science u All tools should capture and store provenance –GenePattern: reproducible computational genomics in Word documents [Jill Mesirov, Science 2010] –Oracle Total Recall –Non-linear history for GIMP [Chen et al., Siggraph 2011] u Simplify the addition of provenance to existing tools [Callahan et al., IPAW 2008]

34 Dagstuhl 2012 Juliana Freire Provenance Enabling 3rd-Party Tools Autodesk Maya ParaView VisIt ImageVis3d [Callahan et al., IPAW 2008]

35 Dagstuhl 2012 Juliana Freire Provenance Plugin for ParaView

36 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science u All tools should capture and store provenance –GenePattern: reproducible computational genomics in Word documents [Jill Mesirov, Science 2010] –Oracle Total Recall –Non-linear history for GIMP [Chen et al., Siggraph 2011] u Simplify the addition of provenance to existing tools [Callahan et al., IPAW 2008] u Provenance SDK in beta –General provenance framework for interactive applications u Experiments published in shared repositories create new opportunities and challenges…

37 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Support for searching, comparing and analyzing experiments and results –Can we find existing approaches to a problem? –Can we discover better approaches to a given problem? –Are we duplicating existing work? Find an experiment that uses MySQL as a back- end to store salinity information about the Columbia River and that outputs a volume rendering of the data.

38 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Support for vague queries and approximate matching –E.g., repository may not contain experiments that perform volume rendering, but only simple 2D plots u Experiments are described at different levels of granularity –Discover relationships between modules –Infer workflows from lower-level specifications

39 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Different kinds of data, i.e., structured workflows, metadata, unstructured source code, raw data –Can we support queries that straddle the different kinds of data? –What is the right language? –Can we design an intuitive query interface---usable by non- experts in databases? u Standing queries: get notifications about new, related work u Track real impact: how many times has an experiment/module/dataset been used?

40 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Synthesizing new results: Given several relevant entries from the repository, can they be combined into a coherent whole?

41 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Synthesizing new results: Given several relevant entries from the repository, can they be combined into a coherent whole? –Automatically find correspondences between computational modules (e.g., module A performs a function that is similar to module B) –Determine compatibility between modules (e.g., can module A be connected to module B?).

42 Dagstuhl 2012 Juliana Freire Discussion u There is no one-size-fits-all solution --- build components! u Many remaining problems… u Need to integrate provenance from multiple tools u Use OS-based provenance (PASS) to help fill the gaps (?) u Need to make it even easier for authors and reviewers, but also need discipline [SIGMOD 2012 Tutorial] –Researchers need to learn how to manage provenance –Version control, virtual machines, etc. u Need better incentives: credit, citations u And maybe a whip too…

43 Dagstuhl 2012 Juliana Freire A Little History and a Challenge u A long time ago, when I was a PhD student, generating the reference list for papers was time consuming –Find proceedings on the shelf (or walk to library), obtain page numbers, type (title, authors, proceedings name, etc.) u But today –Google/Bing author or part of paper title, DBLP, ACM DL, IEEE Explore –Copy bib entry in one of many formats (bibtex, EndNote, plain text), paste in paper, voilà! u Can we do the same for scientific experiments?

44 Dagstuhl 2012 Juliana Freire Acknowledgments u Thanks to: Philippe Bonnet, David Koop, Philip Mates, Matthias Troyer, Emanuele Santos, Dennis Shasha, Claudio Silva, Joel Tohline, Huy T. Vo, and the VisTrails team u This work is partially supported by the National Science Foundation, the Department of Energy, and IBM Faculty Awards.

45 Dagstuhl 2012 Juliana Freire Additional Information u The VisTrails System u An infrastructure to support the creation, review and re-use of reproducible papers u Some videos: Editing an executable paper written using LaTeX and VisTrailshttp:// Exploring a Web-hosted paper using server-based computation An interactive paper on a Wiki

Danke Obrigada Merci Ευχαριστω Thank you

47 Dagstuhl 2012 Juliana Freire Reproducibility Tools and Repositories u VisTrails [Koop et al., ICCS 2011] –Help authors to create reproducible papers: package the results and link from publication –Support testers in the reviewing process, to repeat and validate results u CDE: package software dependencies [Philip Guo, Stanford] u Madagascar: publication of reproducible multi- dimensional data analysis [Sergey Fomel, UT Austin] u GenePattern: reproducible computational genomics in Word documents [Jill Mesirov, Science 2010] u Some repositories: nanoHub.org, crowdLabs.org, myexperiment.org

48 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science Obtain Data Analyze/ Visualize Analyze/ Visualize Publish/Share Provenance Repository Collaborate Provenance Repository

49 Dagstuhl 2012 Juliana Freire Conclusions and Future Work u Provenance is crucial for science and an enabler for executable papers u Provenance must be at the center of the scientific process! u Built an end-to-end solution based on VisTrails--- currently working on integrating infrastructure with other systems –Provenance-enabling other tools u Many challenges and several open research questions u Great opportunity to have impact in science

50 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science Obtain Data Analyze/ Visualize Analyze/ Visualize Publish/S hare Provenance Repository Collaborate Provenance Provenance Repository

51 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Experiments are described at different levels of granularity –Infer workflows from lower-level specifications –Support for vague queries u Different kinds of data, i.e., structured workflows and metadata, unstructured source code, raw data u Synthesizing new results u Track impact: how many times has an experiment (or dataset) been used? u Desiderata –Intuitive query interface---usable by non-experts in databases –Support for pproximate queries

52 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ALPS community: ETH group has published a number of reproducible papers! u Simulations of computational fluid dynamics u Database research: –experiments using distributed database systems, querying Wikipedia –

53 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011 to appear] –Since 2008 verifies the experiments published in accepted papers –Papers submitted for reproducibility evaluation: submissions; submissions –In 2011, lay out a set of guidelines to simplify and expedite the reviewing process; provided tutorials –Review was still challenging »Common problem: setup failed due to implicit dependencies »Easy to solve with a virtual machine… –Reasons for not submitting: »Intellectual property rights on software »Sensitive data »Specific hardware requirements research_repeatability.shtml

54 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011 to appear] –Since 2008 verifies the experiments published in accepted papers –Papers submitted for reproducibility evaluation: submissions; submissions –In 2011, lay out a set of guidelines to simplify and expedite the reviewing process; provided tutorials –Review was still challenging »Common problem: setup failed due to implicit dependencies »Easy to solve with a virtual machine… –Reasons for not submitting: »Intellectual property rights on software »Sensitive data »Specific hardware requirements research_repeatability.shtml

55 Dagstuhl 2012 Juliana Freire Going Forward u Need more and better incentives: –seal of quality, higher quality software/experiments, easier for newcomers in a project, citations, recognition u Need a whip(?): Some disciplines require data for publications, should we require computational experiments too? u Need better tools –There is no one-size-fits-all solution –Many groups building tools---we should join forces and build a Reproducibility Toolkit u Need standard s and guidelines for authors and tool developers u Need provenance support in applications –Integrate provenance from different sources, connect the results ETH does!

56 Dagstuhl 2012 Juliana Freire Provenance Everywhere eBird STEM -Bird_table version in Oracle -STEM v predictions -R Script+results -BirdVis vis spec -matplotlib script+results Provenance-rich presentation Bird_table version predictionsresults predictions

57 Dagstuhl 2012 Juliana Freire Provenance Everywhere eBird STEM -Bird_table version in Oracle -STEM v predictions -R Script+results -BirdVis vis spec -matplotlib script+results Provenance-rich presentation Bird_table version predictionsresults predictions