Mike Hildreth representing the DASPOS project DASPOS Update Mike Hildreth representing the DASPOS project
DASPOS Data And Software Preservation for Open Science multi-disciplinary effort Notre Dame, Chicago, UIUC, Washington, Nebraska, NYU, (Fermilab, BNL) Links HEP effort (DPHEP+experiments) to Biology, Astrophysics, Digital Curation, and other disciplines includes physicists, digital librarians, computer scientists aim to achieve some commonality across disciplines in meta-data descriptions of archived data What’s in the data, how can it be used? computational description (ontology/metadata development) how was the data processed? can computation replication be automated? impact of access policies on preservation infrastructure
DASPOS In parallel, will build test technical infrastructure to implement a knowledge preservation system “Scouting party” to figure out where the most pressing problems lie, and some solutions incorporate input from multi-disciplinary dialogue, use- case definitions, policy discussions Will translate needs of analysts into a technical implementation of meta-data specification Will develop means of specifying processing steps and the requirements of external infrastructure (databases, etc.) Will implement “physics query” infrastructure across small- scale distributed network End result: “template architecture” for data/software/knowledge preservation systems
DASPOS Overview Digital Librarian Expertise How to catalogue and share data How to curate and archive large digital collections Computer Science Expertise How to build databases and query infrastructure How to develop distributed storage networks Science Expertise What does the data mean? How was it processed? How will it be re-used
DASPOS Process Multi-pronged approach for individual topics NYU/Nebraska: RECAST and other developments UIUC/Chicago: Workflows, Containers ND: Metadata, Containers, Workflows, Environment specification Shared validation & examples Workshops & All-hands meetings Shared collaboration with CERN, DPHEP Outreach to other disciplines
Prototype Architecture Container Cluster Test bed Capable of running containerized processes “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? store Data Archive Data Tools: Run containers/workflows Discovery/exploration Unpack/analyze Policy & Curation Access Policies Public archives? Domain-specific Inspire Data path Metadata links
Prototype Architecture Container Cluster Test bed Capable of running containerized processes “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? store Data Archive Data Tools: Run containers/workflows Discovery/exploration Unpack/analyze Policy & Curation Access Policies Public archives? Domain-specific Inspire ~ Done Under development Not done Data path Metadata links
Infrastructure I: Environment Capture
Umbrella
Umbrella Current version of Umbrella can work with: Docker – create container, mount volumes. Parrot – Download tarballs, mount at run=me. Amazon – allocate VM, copy and unpack tarballs. Condor – Request compatible machine. Open Science Framework – deploy uploaded containers Example Umbrella Apps: Povray ray-tracing application http://dx.doi.org/doi:10.7274/R0BZ63ZT OpenMalaria simulation http://dx.doi.org/doi:10.7274/R03F4MH3 CMS high energy physics simulation http://dx.doi.org/doi:10.7274/R0765C7T
Infrastructure II: Workflow Capture
PRUNE
PRUNE Works across multiple workflow repositories Is interfaced with Umbrella for environment specification on multiple platforms reproducible, flexible workflow preservation
Infrastructure III: Metadata HEP Data Model Workshop (“VoCamp15ND”) Participants from HEP, Libraries, & Ontology Community* *new collaborations for DASPOS Define preliminary Data Models for CERN Analysis Portal describe: main high-level elements of an analysis main research objects main processing workflows and products main outcomes of the research process re-use components of developed formal ontologies PROV, Computational Observation Pattern, HEP Taxonomy, etc. Patterns implemented in JSON-LD format for use in CERN Analysis Portal will enable discovery, cross-linking of analysis descriptions
Detector Final State Description published paper at “International Conference on Knowledge Engineering and Knowledge Management” http://ekaw2016.cs.unibo.it Extraction (https://github.com/gordonwatts/HEPOntologyParserExperiments) of test data sets from CMS and ATLAS publications to examine pattern usability and ability facilitate data access across experiments
Computational Activity Continued testing and validation of the Computational Activity and Computational Environment patterns https://github.com/Vocamp/ComputationalActivity). Work on aligning pattern with other vocabularies for software annotation and attribution, including Github and Mozilla Science led “Code as a research object” effort (https://github.com/codemeta/codemeta)
Overall Metadata work structure Integration of patterns into a knowledge flow system that captures provenance and reproducibility information from a computational perspective as well as links to ”higher level” metadata descriptions of the data in terms of physics vocabularies
Technology I: Containers Tools like chroot and Docker sandbox the execution of an application Offer the ability to convert application to a container/image Virtualize only essential functions of the compute node environment, allow local system to provide the rest much faster computation becoming the preferred solution over VMs for many computing environments Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Server Host OS Docker Engine Bin/Libs App A App B Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container
Technology I: Containers Portability = Preservation! Tools like chroot and Docker sandbox the execution of an application Offer the ability to convert application to a container/image Virtualize only essential functions of the compute node environment, allow local system to provide the rest much faster computation becoming the preferred solution over VMs for many computing environments Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Server Host OS Docker Engine Bin/Libs App A App B Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container
Technology II: Smart Containers
4 5 Smart Containers Search Add machine-readable labels API to write metadata Metadata storage and strandardization Specification of data location 4 Add machine-readable labels 5 Link things together into a knowledge graph
Containers Workshop
Containers Workshop
Containers Workshop Captured surging interest in container technologies for all manner of applications Attendees included: Principle Software Engineer of RedHat Senior Software Engineer for Docker ReproZip developers CS Specialists in containers and virtualization Preservation examples: OpenMalaria Bertini (Numerical Algebraic Geometry) HEP analysis
Technology III: CDF/DØ As part of the effort to preserve software, executables, and data for the Tevatron experiments, we performed the pilot installation of the DØ code base outside of Fermilab. use cvmfs to deliver code to any node outside of FNAL, executables running under VMs integrated into batch system data delivered remotely by SAM protocol Used W mass as template analysis Full analysis chain, including remote access to data, code, and executable versions of the software was demonstrated
CERN Analysis Preservation Portal Tools: Run containers/workflows Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? Data Archive Metadata Data Container Cluster CERN OpenStack “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run store
RECAST + CERN Analysis Portal Streamlined demonstration of full preservation chain “RECAST”: re-use of completed/preserved analysis with different inputs for data comparison special schema developed specifically to describe these steps: “packtivity” bundles executable (docker container), environment, executable description specifies individual processing stages “yadage”: captures how pieces fit together into a parametrized workflow CERN Analysis Preservation Portal can store the descriptions of these processes allows for re-use of stored processing chain
RECAST + CERN Analysis Portal Workflow schematic: As stored in CAP
RECAST + CERN Analysis Portal Instructions and workflow descriptions can be extracted from CAP and used to instantiate jobs based on the stored info prototype RECAST Cluster infrastructure (website, result storage, message passing, job queue, workflow engine) fully dockerized themselves Can deploy RECAST service instance to any docker swarm cloud (Carina, Google Container Engine, CERN Container Project) each of these is a re-execution of a preserved ATLAS analysis
Collaborations/Spin-offs RDA: Preservation/Reproducibility Open Science Framework pioneering Campus/OSF interactions Wright State Ontology specialists National Data Service Dashboard, archived computations, containers DIANA collaboration on goals, some preservation efforts
Next Steps Another scouting expedition? Our goal is ultimately to change how science is done in a computing context so that it has greater integrity and productivity. We have developed some prototype techniques (in DASPOS1) that improve the expression and archival of artifacts. Going forward, we want to study how the systematic application of these techniques can enable new, higher level scientific reasoning about a very large body (multidisciplinary) of work. For this to have impact, we will develop small communities of practice that will apply these techniques using the archives and tools relevant to their discipline. Another way to phrase this might be: to study/prototype the kinds of knowledge preservation tools that might make doing science easier and would enable broader/better science.