Using Provenance to Enable Reproducible Science Juliana Freire NYU Poly
2 Dagstuhl 2012 Juliana Freire Science and Reproducibility u Reproducibility is the cornerstone of science
3 Dagstuhl 2012 Juliana Freire Science and Reproducibility
4 Dagstuhl 2012 Juliana Freire Science and Reproducibility u Reproducibility is the cornerstone of science u Science is a self-correcting process---many hypotheses are initially wrong u Without reproducibility, people die! John Wilbanks, AMPS Workshop on Reproducibility 2011
5 Dagstuhl 2012 Juliana Freire Computational Science and Reproducibility u Many disciplines have been transformed by computation: data and computing-intensive –Computations can (often) be reproduced! u Publications also need to change and include a complete and trustworthy scientific record u Authors are being encouraged (or required) to submit reproducible results –ACM SIGMOD, VLDB, Journal of Biostatistics, Science, IEEE Transactions on Signal Processing, Freedom of Information Act and climate science
6 Dagstuhl 2012 Juliana Freire Computational Science and Reproducibility u Many disciplines have been transformed by computation: data and computing-intensive u Publications are changing: need to provide a complete and trustworthy scientific record u Authors are being encouraged to submit reproducible results –ACM SIGMOD, VLDB, Journal of Biostatistics, Science, IEEE Transactions on Signal Processing, …
7 Dagstuhl 2012 Juliana Freire Computational Science and Reproducibility u Many disciplines have been transformed by computation: data and computing-intensive u Publications are changing: need to provide a complete and trustworthy scientific record –Results must be reproducible u Authors are being encouraged to publish reproducible results –ACM SIGMOD, VLDB, Journal of Biostatistics, Science, IEEE Transactions on Signal Processing, Freedom of Information Act and climate science –ETH’s ethics guidelines –NSF data policies
8 Dagstuhl 2012 Juliana Freire Reproducible Publications: Benefits u Produce more re-usable knowledge---not just text u Allow scientists to stand on the shoulders of giants and on their own shoulders! u Science can move faster – u Higher-quality publications –Authors will be more careful –Many eyes to check results u Describe more of the discovery process: learn from successes and mistakes u Expose scientific community to different techniques and tools: expedite their training, reduce time to insight u More impact, more citations
9 Dagstuhl 2012 Juliana Freire A Reproducible Paper ALPS 2.0 matplotlib Libraries Simulation Results Data Reproducible result in paper and on the Web Provenance Workflow [Freedman et al., Phys. Rev. 2012]
10 Dagstuhl 2012 Juliana Freire A Provenance-Rich Paper ALPS 2.0 matplotlib Libraries Simulation Results Data Reproducible result in paper and on the Web Provenance Workflow [Freedman et al., Phys. Rev. 2012]
11 Dagstuhl 2012 Juliana Freire A Reproducible Paper: ALPS2.0 [Bauer et al., JSTAT 2011]
12 Dagstuhl 2012 Juliana Freire Reproducible Publications: Challenges u It is hard for authors to package and publish experiments u It is hard for reviewers to run and evaluate experiments u Level of reproducibility: –Full reproducibility may not be possible, or needed! –Depth: figures scripts data experiments source code –Portability: original environment … different environments u Requirements: –Represent computational experiments –Capture environment information (OS, library versions,…) –Large (or proprietary) data –Special hardware –…
13 Dagstuhl 2012 Juliana Freire Reproducibility Infrastructure u Working with scientists and real requirements –SIGMOD authors –Physicists (Quantum and Astronomy) –Climate scientists – UVCDAT M. Troyer, ETH T. Maxwell, NASA
14 Dagstuhl 2012 Juliana Freire Reproducibility Infrastructure u Working with scientists and real requirements –SIGMOD authors –Physicists (Quantum and Astronomy) –Climate scientists – UVCDAT u Started with VisTrails –Data exploration and visualization tool that provides support for provenance capture and re-use u Depth: from workflow, to libraries, to source code u Portability: Workflow is not enough! [Koop et al., ICCS 2011]
15 Dagstuhl 2012 Juliana Freire Reproducibility Infrastructure: Portability u Need ‘more’ provenance: computational environment (OS, library versions, etc.) –Also use virtual machines, CDEPack u Need better file management
16 Dagstuhl 2012 Juliana Freire Provenance Links to Data 16 DATAFIGURE SIMULATIONANALYSIS Which workflow derived the file? Has the data changed? Which parameters were used? input.txt figure.png
17 Dagstuhl 2012 Juliana Freire Identification Methods u Upstream signature –Identify workflow outputs by the computational steps and parameters that (may have) had an effect on the output u Content hashing –Identify files by their content –Mirrored by version information from version control systems u Universally Unique Identifiers (UUID) –Serve as “filenames” –Could use some other scheme here
18 Dagstuhl 2012 Juliana Freire Strong Links: Workflows and Outputs Identify computational steps and parameters that had an effect on the output Also useful for caching
19 Dagstuhl 2012 Juliana Freire Linking Data to Computations 19 newfilename.dat HASH CONTENTS QUERY FILE STORE OBTAIN FILE REFERENCE 12ab3-45ef2... QUERY PROVENANCE 0ab678cd... OBTAIN WORKFLOWS Content hashing: Identify files by their content Universally Unique Identifiers (UUID) serve as file names
20 Dagstuhl 2012 Juliana Freire Linking Computations to Data QUERY PROVENANCE OBTAIN INPUT REFS 12ab3-45ef2... QUERY FILE STORE 12ab3-45ef2... OBTAIN INPUT FILES input files
21 Dagstuhl 2012 Juliana Freire u Workflow is not enough u Need ‘more’ provenance: computational environment (OS, library versions, etc.) –Also use virtual machines, CDEPack u Need better file management –Designed support for strong links between data and their provenance –Use versioning servers (e.g., GIT, SVN, Oracle DBFS) –Implemented in VisTrails (Persistence Package) u Portability, maintenance and longevity –Workflows need to be run in different environments –Support for different execution models (local, remote, mixed) –Software evolves: need to upgrade workflows/experiments Reproducibility Infrastructure: Portability
22 Dagstuhl 2012 Juliana Freire Workflow Upgrades matplotlib 1.0, csv 0.1matplotlib 1.1, csv 1.0
23 Dagstuhl 2012 Juliana Freire Workflow Upgrades 23 u Implementation Change u Interface Change u Deprecation, Addition, or Replacement def compute(self): fig = pylab.figure()... pylab.scatter(x_data, y_data)... def compute(self): fig = pylab.figure() pylab.setp(fig, facecolor=′w′)... pylab.scatter(x_data, y_data)...
24 Dagstuhl 2012 Juliana Freire Our Approach u Maintain information about library versions u Identify all incompatible modules in the workflow u Group all incompatible modules by package u For each module group: –If developer added logic for upgrades, use those routines –Otherwise try to perform automatic upgrades on all modules –On failure, show the incompatible workflow, notify the user, and allow the user to make changes u Learn from ‘fixes’ –VisTrails keeps provenance of workflow evolution –Use provenance of manual upgrades to automatically patch other workflows 24
25 Dagstuhl 2012 Juliana Freire u Connect results to their provenance –Support LateX, Word, Powerpoint, HTML, wiki u Support for reviewers Reproducibility Infrastructure \begin{figure} \vistrail[filename=ladder_dyl_gap_theta- 2.xml,version=5,pdf, buildalways, getvtl, embedworkflow, execute]{width=8cm} \caption{(color online) Ground-state degeneracy splitting of the non-Hermitian doubled Yang-Lee model when perturbed by a string tension (θ ̸ = 0).} \label{fig:figure} \end{center} \end{figure} \begin{figure} \vistrail[filename=ladder_dyl_gap_theta- 2.xml,version=5,pdf, buildalways, getvtl, embedworkflow, execute]{width=8cm} \caption{(color online) Ground-state degeneracy splitting of the non-Hermitian doubled Yang-Lee model when perturbed by a string tension (θ ̸ = 0).} \label{fig:figure} \end{center} \end{figure} \vistrail[host=alps.ethz.ch,db=vistrails,vtid=10,v ersion=169,pdf]{width=8cm} Local Remote
26 Dagstuhl 2012 Juliana Freire u Connect results to their provenance –Support LateX, Word, Powerpoint, HTML, wiki u Support for reviewers –Explore parameters and configurations, keep provenance of this too! –Provenance as means for reviewers and authors to communicate u Enable interaction with results: the VisMashup system –Publish using different media, not just documents Reproducibility Infrastructure
27 Dagstuhl 2012 Juliana Freire Workflow Mashups
28 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011] –Pilot started in now a permanent feature at the conference u New at VLDB 2013! u IEEE VisWeek (work in progress) u Quantum Physics community u Astrophysics u Climate Science
29 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011] –Since 2008 –Papers submitted for reproducibility evaluation: submissions; submissions –In 2011, lay out a set of guidelines to simplify and expedite the reviewing process; provided tutorials –Review was still challenging »Common problem: setup failed due to implicit dependencies –Reasons for not submitting: »Intellectual property rights on software »Sensitive data and specific hardware requirements research_repeatability.shtml
30 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011] –Since 2008 –Papers submitted for reproducibility evaluation: submissions; submissions –In 2011, lay out a set of guidelines to simplify and expedite the reviewing process; provided tutorials –Review was still challenging »Common problem: setup failed due to implicit dependencies –Reasons for not submitting: »Intellectual property rights on software »Sensitive data and specific hardware requirements research_repeatability.shtml
31 Dagstuhl 2012 Juliana Freire Data Life Cycle Obtain Data Analyze/ Visualize Analyze/ Visualize User studies Sensors Web Databases Simulations Publish/S hare Sequencing machines Particle colliders AVS Taverna VisTrails
32 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science Obtain Data Analyze/Visualize Publish/Share Provenance Collaborate
33 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science u All tools should capture and store provenance –GenePattern: reproducible computational genomics in Word documents [Jill Mesirov, Science 2010] –Oracle Total Recall –Non-linear history for GIMP [Chen et al., Siggraph 2011] u Simplify the addition of provenance to existing tools [Callahan et al., IPAW 2008]
34 Dagstuhl 2012 Juliana Freire Provenance Enabling 3rd-Party Tools Autodesk Maya ParaView VisIt ImageVis3d [Callahan et al., IPAW 2008]
35 Dagstuhl 2012 Juliana Freire Provenance Plugin for ParaView
36 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science u All tools should capture and store provenance –GenePattern: reproducible computational genomics in Word documents [Jill Mesirov, Science 2010] –Oracle Total Recall –Non-linear history for GIMP [Chen et al., Siggraph 2011] u Simplify the addition of provenance to existing tools [Callahan et al., IPAW 2008] u Provenance SDK in beta –General provenance framework for interactive applications u Experiments published in shared repositories create new opportunities and challenges…
37 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Support for searching, comparing and analyzing experiments and results –Can we find existing approaches to a problem? –Can we discover better approaches to a given problem? –Are we duplicating existing work? Find an experiment that uses MySQL as a back- end to store salinity information about the Columbia River and that outputs a volume rendering of the data.
38 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Support for vague queries and approximate matching –E.g., repository may not contain experiments that perform volume rendering, but only simple 2D plots u Experiments are described at different levels of granularity –Discover relationships between modules –Infer workflows from lower-level specifications
39 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Different kinds of data, i.e., structured workflows, metadata, unstructured source code, raw data –Can we support queries that straddle the different kinds of data? –What is the right language? –Can we design an intuitive query interface---usable by non- experts in databases? u Standing queries: get notifications about new, related work u Track real impact: how many times has an experiment/module/dataset been used?
40 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Synthesizing new results: Given several relevant entries from the repository, can they be combined into a coherent whole?
41 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Synthesizing new results: Given several relevant entries from the repository, can they be combined into a coherent whole? –Automatically find correspondences between computational modules (e.g., module A performs a function that is similar to module B) –Determine compatibility between modules (e.g., can module A be connected to module B?).
42 Dagstuhl 2012 Juliana Freire Discussion u There is no one-size-fits-all solution --- build components! u Many remaining problems… u Need to integrate provenance from multiple tools u Use OS-based provenance (PASS) to help fill the gaps (?) u Need to make it even easier for authors and reviewers, but also need discipline [SIGMOD 2012 Tutorial] –Researchers need to learn how to manage provenance –Version control, virtual machines, etc. u Need better incentives: credit, citations u And maybe a whip too…
43 Dagstuhl 2012 Juliana Freire A Little History and a Challenge u A long time ago, when I was a PhD student, generating the reference list for papers was time consuming –Find proceedings on the shelf (or walk to library), obtain page numbers, type (title, authors, proceedings name, etc.) u But today –Google/Bing author or part of paper title, DBLP, ACM DL, IEEE Explore –Copy bib entry in one of many formats (bibtex, EndNote, plain text), paste in paper, voilà! u Can we do the same for scientific experiments?
44 Dagstuhl 2012 Juliana Freire Acknowledgments u Thanks to: Philippe Bonnet, David Koop, Philip Mates, Matthias Troyer, Emanuele Santos, Dennis Shasha, Claudio Silva, Joel Tohline, Huy T. Vo, and the VisTrails team u This work is partially supported by the National Science Foundation, the Department of Energy, and IBM Faculty Awards.
45 Dagstuhl 2012 Juliana Freire Additional Information u The VisTrails System u An infrastructure to support the creation, review and re-use of reproducible papers u Some videos: Editing an executable paper written using LaTeX and VisTrailshttp:// Exploring a Web-hosted paper using server-based computation An interactive paper on a Wiki
Danke Obrigada Merci Ευχαριστω Thank you
47 Dagstuhl 2012 Juliana Freire Reproducibility Tools and Repositories u VisTrails [Koop et al., ICCS 2011] –Help authors to create reproducible papers: package the results and link from publication –Support testers in the reviewing process, to repeat and validate results u CDE: package software dependencies [Philip Guo, Stanford] u Madagascar: publication of reproducible multi- dimensional data analysis [Sergey Fomel, UT Austin] u GenePattern: reproducible computational genomics in Word documents [Jill Mesirov, Science 2010] u Some repositories: nanoHub.org, crowdLabs.org, myexperiment.org
48 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science Obtain Data Analyze/ Visualize Analyze/ Visualize Publish/Share Provenance Repository Collaborate Provenance Repository
49 Dagstuhl 2012 Juliana Freire Conclusions and Future Work u Provenance is crucial for science and an enabler for executable papers u Provenance must be at the center of the scientific process! u Built an end-to-end solution based on VisTrails--- currently working on integrating infrastructure with other systems –Provenance-enabling other tools u Many challenges and several open research questions u Great opportunity to have impact in science
50 Dagstuhl 2012 Juliana Freire Vision: Provenance-Rich Science Obtain Data Analyze/ Visualize Analyze/ Visualize Publish/S hare Provenance Repository Collaborate Provenance Provenance Repository
51 Dagstuhl 2012 Juliana Freire Challenges and Opportunities u Experiments are described at different levels of granularity –Infer workflows from lower-level specifications –Support for vague queries u Different kinds of data, i.e., structured workflows and metadata, unstructured source code, raw data u Synthesizing new results u Track impact: how many times has an experiment (or dataset) been used? u Desiderata –Intuitive query interface---usable by non-experts in databases –Support for pproximate queries
52 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ALPS community: ETH group has published a number of reproducible papers! u Simulations of computational fluid dynamics u Database research: –experiments using distributed database systems, querying Wikipedia –
53 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011 to appear] –Since 2008 verifies the experiments published in accepted papers –Papers submitted for reproducibility evaluation: submissions; submissions –In 2011, lay out a set of guidelines to simplify and expedite the reviewing process; provided tutorials –Review was still challenging »Common problem: setup failed due to implicit dependencies »Easy to solve with a virtual machine… –Reasons for not submitting: »Intellectual property rights on software »Sensitive data »Specific hardware requirements research_repeatability.shtml
54 Dagstuhl 2012 Juliana Freire Current Uses and Experiences u ACM SIGMOD repeatability effort [Bonnet et al., SIGMOD Record 2011 to appear] –Since 2008 verifies the experiments published in accepted papers –Papers submitted for reproducibility evaluation: submissions; submissions –In 2011, lay out a set of guidelines to simplify and expedite the reviewing process; provided tutorials –Review was still challenging »Common problem: setup failed due to implicit dependencies »Easy to solve with a virtual machine… –Reasons for not submitting: »Intellectual property rights on software »Sensitive data »Specific hardware requirements research_repeatability.shtml
55 Dagstuhl 2012 Juliana Freire Going Forward u Need more and better incentives: –seal of quality, higher quality software/experiments, easier for newcomers in a project, citations, recognition u Need a whip(?): Some disciplines require data for publications, should we require computational experiments too? u Need better tools –There is no one-size-fits-all solution –Many groups building tools---we should join forces and build a Reproducibility Toolkit u Need standard s and guidelines for authors and tool developers u Need provenance support in applications –Integrate provenance from different sources, connect the results ETH does!
56 Dagstuhl 2012 Juliana Freire Provenance Everywhere eBird STEM -Bird_table version in Oracle -STEM v predictions -R Script+results -BirdVis vis spec -matplotlib script+results Provenance-rich presentation Bird_table version predictionsresults predictions
57 Dagstuhl 2012 Juliana Freire Provenance Everywhere eBird STEM -Bird_table version in Oracle -STEM v predictions -R Script+results -BirdVis vis spec -matplotlib script+results Provenance-rich presentation Bird_table version predictionsresults predictions