Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reproducible Science Gordon Watts (University of Washington/Seattle) @SeattleGordon 2017-09-22 Center for modeling complex interactions G. Watts (UW/Seattle)

Similar presentations


Presentation on theme: "Reproducible Science Gordon Watts (University of Washington/Seattle) @SeattleGordon 2017-09-22 Center for modeling complex interactions G. Watts (UW/Seattle)"— Presentation transcript:

1 Reproducible Science Gordon Watts (University of Washington/Seattle) @SeattleGordon Center for modeling complex interactions G. Watts (UW/Seattle)

2 Outline Introduction and Motivation Work at UW eScience Center
Work from DASPOS Work from CERN Funding Agencies Conclusions & Links G. Watts (UW/Seattle)

3 The Royal Society’s motto
Nullius in Verba “take no one’s word for it” The Royal Society’s motto (founded 1660) G. Watts (UW/Seattle)

4 Gatherings where you would see the experiments reproduced
The Royal Society Gatherings where you would see the experiments reproduced Robert Boyle Documentation standards so you could imagine being in the room Provides evidence the claim is true Check for fraud and error A “springboard for progress” by enabling replication The knowledge and the techniques used to gain that knowledge G. Watts (UW/Seattle)

5 Is this science today? G. Watts (UW/Seattle)

6 For a search at the world’s largest experiment
19 pages of text and plots 1 title page 4 pages of references 13 pages of author list For a search at the world’s largest experiment G. Watts (UW/Seattle)

7 G. Watts (UW/Seattle)

8 Complexity of Modern analysis software
Big Science Big Datasets Complexity of Modern analysis software Page limits on articles G. Watts (UW/Seattle)

9 Can we do better? We might not be able to build a second Large Hadron Collider Perhaps some of the same tools that enable Big Science/Big Datasets can be used to address this G. Watts (UW/Seattle)

10 This is a many year effort!
We have not yet codified best practices as Boyle did for this new environment G. Watts (UW/Seattle)

11 UW eScience Center G. Watts (UW/Seattle)

12 https://gordonwatts.github.io/ros-roadshow
An adaptation of the standard eScience Road Show (Workshops for Students) G. Watts (UW/Seattle)

13 Badges https://github.com/uwescience-open-badges
A small icon to be displayed on your Poster or Talk or Paper Open Data Badge Your data and your analysis code is open and available Open Materials Badge Your analysis code is open and available (e.g. medical data) Badges are unique for each of your submissions (DOI’s) “Signaling that you value openness will help to shift the expected norms of behavior in science.” This is a grassroots effort G. Watts (UW/Seattle)

14 Badges Submit Request Review Result
Include disclosure statement (where to find materials, etc.) Request badge via , pull request, etc. Include the PDF of your talk, paper, poster. Review Two reviewers COS disclosure process Confirm that the links lead to data and code Will not attempt to run Will not attempt to insure that the data is the relevant data Loop to address reviewer comments Result Reviews and result published anonymously in a github repro for this submission DOI assigned to this repro (which includes a copy of your submission) Badge image which includes DOI linked to the repro is created G. Watts (UW/Seattle)

15 The Practice of Reproducible Research:
Case Studies and Lessons from the Data-Intensive Sciences Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018) A massive book being written by members of the eScience Center Part I: Practicing Reproducibility Part II: High-Level Case Studies Part III: Low-Level Case Studies (sampling of topics from last two sections) G. Watts (UW/Seattle)

16 DASPOS G. Watts (UW/Seattle)

17 Data And Software Preservation for Open Science
Notre Dame, Chicago, UIUC, Washington, Nebraska, NYU, (Fermilab, BNL) Initial Goal: How to preserve knowledge from particle physics experiments This $3M grant started from the premise that backing up the data was easy. (re) Using it is hard Physicists, digital librarians, and computer scientists High Energy Physics linked to Biology, Astrophysics, Digital Curation, etc. G. Watts (UW/Seattle)

18 Data + Software + Knowledge
Processed Data Selection Simplification Reduced Data Selection Analysis Simplification Twice-reduced Data Workflow at an LHC experiment is more than a simple linear sequence. How to preserve that – and make it reproducible? Data & Metadata Instructions on how to process data Lab notebooks, web pages, WebAPI’s From LHC More Inputs Twice-reduced Data Simulated Data Further Analysis Higgs Discovery! All must be preserved! G. Watts (UW/Seattle)

19 Meta Data https://github.com/Vocamp/ComputationalActivity
Meta Data Use the Web Ontology Language (OWL) to describe meta data Part of the semantic web Searchable, discoverable, common tools to manipulate, etc. Encodes OS my.exe v4.6.31 v21.6A Input files 1 N libAA v3.12.6 libCC v1.1.9 Config v0.1 Databases libBB v6.1.3 external Output files M G. Watts (UW/Seattle)

20 Capturing a Computation Step
(as an example) Containers have replaced Virtual Machines as the preferred way Light weight, low resource usage Very fast to start Composable Must capture along with container Software, build environment Environment Instructions to run (metadata) But current containerization tools require expert usage What should be captured? What not? Permissions? Building the container? Smart containers can be linked in a knowledge graph to complete a full analysis G. Watts (UW/Seattle)

21 Work at CERN Putting this together G. Watts (UW/Seattle)

22 Workflow for An Analysis
Workflow for An Analysis “Analysis” Data Workflow New Models Can we easily re-run an analysis when a new model becomes available? Repeat the data/theory comparison forever Preserve the analysis for reuse

23 Wouldn’t it be nice if you had a tool that:
Automatically captured provenance and processing details for everything you did could manage the bookkeeping for thousands of analysis jobs provided a way of saving snapshots of work could tell you “what did I do last week?” G. Watts (UW/Seattle)

24 Outreach & Research: open to public
Internal Envisioned as a tool for the analysts to preserve, share and re-use analysis techniques, executables, and code External Outreach & Research: open to public CMS has released AOD-level data, VMs for analysis code for 2010, 2011, and analysis examples *(2 external papers)* Starting to look for partners outside of HEP G. Watts (UW/Seattle)

25 Link existing pieces of infrastructure
CERN Gitlab connects to a computational backend G. Watts (UW/Seattle)

26 Workflow Capture Each computation’s environment and software is specified in a container JSON to chain the steps into a workflow, provide required metadata Complete flexibility. CERN IT Provided backend G. Watts (UW/Seattle)

27 Funding Agencies G. Watts (UW/Seattle)

28 Policy Questions mpsopendata.crc.nd.edu What data should be saved?
What else should be saved besides the data to enable re-use? Where should it be stored? Who pays for the storage? Federal grants typically run 3 years after this the money is spent Who pays to allow the public to access the data? network & storage infrastructure isn’t free Funding agencies developing policies, need guidance NSF (and DOE) want to update their Data Management Plan guidelines Looking for input from the communities they serve

29 Policy Responses Community exercise to provide feedback to NSF on knowledge preservation questions from Physical Sciences (input from APS, ACS, surveys, workshops, etc.) Final Report just issued, some conclusions: Data and other digital artifacts upon which publications are based should be made publicly available in a digital, machine-readable format, and persistently linked to those relevant publications. Different disciplines have a wide variety of current practices and expectations for appropriate levels of the sharing of data and other scientific results. A discipline-specific policy discussion will be required to determine an appropriate level of preservation and re-use. The provision of public access to data entails costs in infrastructure and human effort, and that some types of data may be impractical to archive, annotate, and share. Cost-benefit analyses should be conducted in order to set the level of expectations for the researcher, his or her institution, and the funding agency. Creating incentives toward the sharing of data is the primary way to accomplish broad adoption of open access to data as the norm Exploring and understanding these ingredients are the next steps for the science disciplines

30 Conclusions Tools Next Big Task Improve the tools! Lots of personal and career benefits to using Open Science and Reproducible techniques in our daily work. Common Techniques Favor composable techniques Command line – all things scriptable and controllable from a common script As many non-proprietary tools as possible Good tools are already there Git/GitHub Containers and VM’s Notebooks Web Tools Get Started! Do not hesitate to publish and share new tools This is where the frontier of this effort currently exists Tools are all almost certainly going to be domain specific G. Watts (UW/Seattle)

31 References Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press. Mike Hildreth (Notre Dame), “Data, Software, and Knowledge Preservation of Scientific Results”, ACAT 2017 (Seattle). UW eScience Center The reproducibility working group: DASPOS - eScience slides contain direct links as well G. Watts (UW/Seattle)


Download ppt "Reproducible Science Gordon Watts (University of Washington/Seattle) @SeattleGordon 2017-09-22 Center for modeling complex interactions G. Watts (UW/Seattle)"

Similar presentations


Ads by Google