Reproducible Science Gordon Watts (University of Washington/Seattle) @SeattleGordon 2017-09-22 Center for modeling complex interactions G. Watts (UW/Seattle)

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Experimental Psychology PSY 433
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
August 14, 2015 Research data management – an introduction Slides provided by the DaMaRO Project, University of Oxford Research Services.
EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers.
Introduction of Geoprocessing Topic 7a 4/10/2007.
Fundamental Programming: Fundamental Programming K.Chinnasarn, Ph.D.
June 3, 2016 Research data management – an introduction Slides provided by the DaMaRO Project, University of Oxford Research Services.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
Dataset citation Clickable link to Dataset in the archive Sarah Callaghan (NCAS-BADC) and the NERC Data Citation and Publication team
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
Introduction of Geoprocessing Lecture 9 3/24/2008.
Data Citation Implementation Pilot Workshop
Chris Knight Beginners’ workshop.
| 1 Anita de Waard, VP Research Data Collaborations Elsevier RDM Services May 20, 2016 Publishing The Full Research Cycle To Support.
”Smart Containers” Charles F. Vardeman II, Da Huo, Michelle Cheatham, James Sweet, and Jaroslaw Nabrzyski
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
School on Grid & Cloud Computing International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Data Mining for Expertise: Using Scopus to Create Lists of Experts for U.S. Department of Education Discretionary Grant Programs Good afternoon, my name.
Enhancements to Galaxy for delivering on NIH Commons
NRF Open Access Statement
Jeff Moon Data Librarian &
Accessing the VI-SEEM infrastructure
Making Cross-campus, Inter-institutional Collaborations Work
Databases vs the Internet
Advanced Computer Systems
Mike Hildreth representing the DASPOS Team
Mike Hildreth representing the DASPOS project
HEP LTDP Use Case & EOSC Pilot
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,
Working Group 4 Facilities and Technologies
Center for Open Science: Practical Steps for Increasing Openness
ReproZip: Computational Reproducibility With Ease
Jarek Nabrzyski Director, Center for Research Computing
The importance of being Connected
GC101 Introduction to computers and programs
Mike Hildreth representing the DASPOS Team
ACS 2016 Moving research forward with persistent identifiers
Scaling the Open Science Framework: National Data Service Dashboard, Cloud Storage Add-ons, and Sharing Science Data on the Decentralized Web Natalie K.
Tim Smith CERN Geneva, Switzerland
A Framework for Managing and Sharing Research Workflow
Campus Cyberinfrastructure
Publishing software and data
SowiDataNet - A User-Driven Repository for Data Sharing and Centralizing Research Data from the Social and Economic Sciences in Germany Monika Linne, 30.
Enhancing Scholarly Communication with ReproZip
Topic J: Gathering evidence 3. Strategic paper gathering
SRA Submission Pipeline
Open Access to your Research Papers and Data
Introduction to Research Data Management
Training Course on Data Management for Information Professionals and In-Depth Digitization Practicum September 2011, Oostende, Belgium Concepts.
Experimental Psychology PSY 433
An ecosystem of contributions
Information Technology Ms. Abeer Helwa
Research Data Management
Introduction to computers
Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference.
Research Infrastructures: Ensuring trust and quality of data
Building an open library without walls : Archiving of particle physics data and results for long-term access and use Joanne Yeomans CERN Scientific Information.
Overview of Workflows: Why Use Them?
Co-Chairs: Mike Hildreth (Notre Dame), Ruth Duerr (Ronin Inst.) + ?
Data + Research Elements What Publishers Can Do (and Are Doing) to Facilitate Data Integration and Attribution David Parsons – Lawrence, KS, 13th February.
ENG/200 RHETORIC AND RESEARCH The Latest Version // uopcourse.com
ENG/200 ENG/ 200 eng/200 eng/ 200 RHETORIC AND RESEARCH The Latest Version // uopstudy.com
Presentation transcript:

Reproducible Science Gordon Watts (University of Washington/Seattle) @SeattleGordon 2017-09-22 Center for modeling complex interactions G. Watts (UW/Seattle)

Outline Introduction and Motivation Work at UW eScience Center Work from DASPOS Work from CERN Funding Agencies Conclusions & Links G. Watts (UW/Seattle)

The Royal Society’s motto Nullius in Verba “take no one’s word for it” The Royal Society’s motto (founded 1660) G. Watts (UW/Seattle)

Gatherings where you would see the experiments reproduced The Royal Society Gatherings where you would see the experiments reproduced Robert Boyle Documentation standards so you could imagine being in the room Provides evidence the claim is true Check for fraud and error A “springboard for progress” by enabling replication The knowledge and the techniques used to gain that knowledge G. Watts (UW/Seattle)

Is this science today? G. Watts (UW/Seattle)

For a search at the world’s largest experiment 19 pages of text and plots 1 title page 4 pages of references 13 pages of author list For a search at the world’s largest experiment G. Watts (UW/Seattle)

G. Watts (UW/Seattle)

Complexity of Modern analysis software Big Science Big Datasets Complexity of Modern analysis software Page limits on articles G. Watts (UW/Seattle)

Can we do better? We might not be able to build a second Large Hadron Collider Perhaps some of the same tools that enable Big Science/Big Datasets can be used to address this G. Watts (UW/Seattle)

This is a many year effort! We have not yet codified best practices as Boyle did for this new environment G. Watts (UW/Seattle)

UW eScience Center G. Watts (UW/Seattle)

https://gordonwatts.github.io/ros-roadshow An adaptation of the standard eScience Road Show (Workshops for Students) G. Watts (UW/Seattle)

Badges https://github.com/uwescience-open-badges A small icon to be displayed on your Poster or Talk or Paper Open Data Badge Your data and your analysis code is open and available Open Materials Badge Your analysis code is open and available (e.g. medical data) Badges are unique for each of your submissions (DOI’s) “Signaling that you value openness will help to shift the expected norms of behavior in science.” This is a grassroots effort G. Watts (UW/Seattle)

Badges Submit Request Review Result Include disclosure statement (where to find materials, etc.) Request badge via Email, pull request, etc. Include the PDF of your talk, paper, poster. Review Two reviewers COS disclosure process Confirm that the links lead to data and code Will not attempt to run Will not attempt to insure that the data is the relevant data Loop to address reviewer comments Result Reviews and result published anonymously in a github repro for this submission DOI assigned to this repro (which includes a copy of your submission) Badge image which includes DOI linked to the repro is created G. Watts (UW/Seattle)

The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018) https://www.practicereproducibleresearch.org/ A massive book being written by members of the eScience Center Part I: Practicing Reproducibility Part II: High-Level Case Studies Part III: Low-Level Case Studies (sampling of topics from last two sections) G. Watts (UW/Seattle)

DASPOS G. Watts (UW/Seattle)

Data And Software Preservation for Open Science Notre Dame, Chicago, UIUC, Washington, Nebraska, NYU, (Fermilab, BNL) Initial Goal: How to preserve knowledge from particle physics experiments This $3M grant started from the premise that backing up the data was easy. (re) Using it is hard Physicists, digital librarians, and computer scientists High Energy Physics linked to Biology, Astrophysics, Digital Curation, etc. G. Watts (UW/Seattle)

Data + Software + Knowledge Processed Data Selection Simplification Reduced Data Selection Analysis Simplification Twice-reduced Data Workflow at an LHC experiment is more than a simple linear sequence. How to preserve that – and make it reproducible? Data & Metadata Instructions on how to process data Lab notebooks, web pages, WebAPI’s From LHC More Inputs Twice-reduced Data Simulated Data Further Analysis Higgs Discovery! All must be preserved! G. Watts (UW/Seattle)

Meta Data https://github.com/Vocamp/ComputationalActivity https://github.com/Vocamp/DetectorFinalState Meta Data Use the Web Ontology Language (OWL) to describe meta data Part of the semantic web Searchable, discoverable, common tools to manipulate, etc. Encodes OS my.exe v4.6.31 v21.6A Input files 1 N … libAA v3.12.6 libCC v1.1.9 Config v0.1 Databases libBB v6.1.3 external Output files M G. Watts (UW/Seattle)

Capturing a Computation Step (as an example) Containers have replaced Virtual Machines as the preferred way Light weight, low resource usage Very fast to start Composable Must capture along with container Software, build environment Environment Instructions to run (metadata) But current containerization tools require expert usage What should be captured? What not? Permissions? Building the container? Smart containers can be linked in a knowledge graph to complete a full analysis G. Watts (UW/Seattle)

Work at CERN Putting this together G. Watts (UW/Seattle)

Workflow for An Analysis https://recast.perimeterinstitute.ca/ Workflow for An Analysis “Analysis” Data Workflow New Models Can we easily re-run an analysis when a new model becomes available? Repeat the data/theory comparison forever Preserve the analysis for reuse

Wouldn’t it be nice if you had a tool that: Automatically captured provenance and processing details for everything you did could manage the bookkeeping for thousands of analysis jobs provided a way of saving snapshots of work could tell you “what did I do last week?” G. Watts (UW/Seattle)

Outreach & Research: open to public http://opendata.cern.ch/ Internal Envisioned as a tool for the analysts to preserve, share and re-use analysis techniques, executables, and code External Outreach & Research: open to public CMS has released AOD-level data, VMs for analysis code for 2010, 2011, and analysis examples *(2 external papers)* Starting to look for partners outside of HEP G. Watts (UW/Seattle)

Link existing pieces of infrastructure CERN Gitlab connects to a computational backend G. Watts (UW/Seattle)

Workflow Capture Each computation’s environment and software is specified in a container JSON to chain the steps into a workflow, provide required metadata Complete flexibility. CERN IT Provided backend G. Watts (UW/Seattle)

Funding Agencies G. Watts (UW/Seattle)

Policy Questions mpsopendata.crc.nd.edu What data should be saved? What else should be saved besides the data to enable re-use? Where should it be stored? Who pays for the storage? Federal grants typically run 3 years after this the money is spent Who pays to allow the public to access the data? network & storage infrastructure isn’t free Funding agencies developing policies, need guidance NSF (and DOE) want to update their Data Management Plan guidelines Looking for input from the communities they serve

Policy Responses Community exercise to provide feedback to NSF on knowledge preservation questions from Physical Sciences (input from APS, ACS, surveys, workshops, etc.) Final Report just issued, some conclusions: Data and other digital artifacts upon which publications are based should be made publicly available in a digital, machine-readable format, and persistently linked to those relevant publications. Different disciplines have a wide variety of current practices and expectations for appropriate levels of the sharing of data and other scientific results. A discipline-specific policy discussion will be required to determine an appropriate level of preservation and re-use. The provision of public access to data entails costs in infrastructure and human effort, and that some types of data may be impractical to archive, annotate, and share. Cost-benefit analyses should be conducted in order to set the level of expectations for the researcher, his or her institution, and the funding agency. Creating incentives toward the sharing of data is the primary way to accomplish broad adoption of open access to data as the norm Exploring and understanding these ingredients are the next steps for the science disciplines

Conclusions Tools Next Big Task Improve the tools! Lots of personal and career benefits to using Open Science and Reproducible techniques in our daily work. Common Techniques Favor composable techniques Command line – all things scriptable and controllable from a common script As many non-proprietary tools as possible Good tools are already there Git/GitHub Containers and VM’s Notebooks Web Tools Get Started! Do not hesitate to publish and share new tools This is where the frontier of this effort currently exists Tools are all almost certainly going to be domain specific G. Watts (UW/Seattle)

References Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press. Mike Hildreth (Notre Dame), “Data, Software, and Knowledge Preservation of Scientific Results”, ACAT 2017 (Seattle). https://indico.cern.ch/event/567550/contributions/2656689/ UW eScience Center - http://escience.washington.edu/ - The reproducibility working group: http://uwescience.github.io/reproducible/ DASPOS - http://daspos.org/ eScience slides contain direct links as well G. Watts (UW/Seattle)