Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mike Hildreth representing the DASPOS project

Similar presentations


Presentation on theme: "Mike Hildreth representing the DASPOS project"— Presentation transcript:

1 Mike Hildreth representing the DASPOS project
DASPOS Update Mike Hildreth representing the DASPOS project

2 DASPOS Data And Software Preservation for Open Science
multi-disciplinary effort Notre Dame, Chicago, UIUC, Washington, Nebraska, NYU, (Fermilab, BNL) Links HEP effort (DPHEP+experiments) to Biology, Astrophysics, Digital Curation, and other disciplines includes physicists, digital librarians, computer scientists aim to achieve some commonality across disciplines in meta-data descriptions of archived data What’s in the data, how can it be used? computational description (ontology/metadata development) how was the data processed? can computation replication be automated? impact of access policies on preservation infrastructure

3 DASPOS In parallel, will build test technical infrastructure to implement a knowledge preservation system “Scouting party” to figure out where the most pressing problems lie, and some solutions incorporate input from multi-disciplinary dialogue, use- case definitions, policy discussions Will translate needs of analysts into a technical implementation of meta-data specification Will develop means of specifying processing steps and the requirements of external infrastructure (databases, etc.) Will implement “physics query” infrastructure across small- scale distributed network End result: “template architecture” for data/software/knowledge preservation systems

4 DASPOS Overview Digital Librarian Expertise
How to catalogue and share data How to curate and archive large digital collections Computer Science Expertise How to build databases and query infrastructure How to develop distributed storage networks Science Expertise What does the data mean? How was it processed? How will it be re-used

5 DASPOS Process Multi-pronged approach for individual topics
NYU/Nebraska: RECAST and other developments UIUC/Chicago: Workflows, Containers ND: Metadata, Containers, Workflows, Environment specification Shared validation & examples Workshops & All-hands meetings Shared collaboration with CERN, DPHEP Outreach to other disciplines

6 Prototype Architecture
Container Cluster Test bed Capable of running containerized processes “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? store Data Archive Data Tools: Run containers/workflows Discovery/exploration Unpack/analyze Policy & Curation Access Policies Public archives? Domain-specific Inspire Data path Metadata links

7 Prototype Architecture
Container Cluster Test bed Capable of running containerized processes “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? store Data Archive Data Tools: Run containers/workflows Discovery/exploration Unpack/analyze Policy & Curation Access Policies Public archives? Domain-specific Inspire ~ Done Under development Not done Data path Metadata links

8 Infrastructure I: Environment Capture

9 Umbrella

10 Umbrella Current version of Umbrella can work with:
Docker – create container, mount volumes. Parrot – Download tarballs, mount at run=me. Amazon – allocate VM, copy and unpack tarballs. Condor – Request compatible machine. Open Science Framework – deploy uploaded containers Example Umbrella Apps: Povray ray-tracing application OpenMalaria simulation CMS high energy physics simulation

11 Infrastructure II: Workflow Capture

12 PRUNE

13 PRUNE Works across multiple workflow repositories
Is interfaced with Umbrella for environment specification on multiple platforms reproducible, flexible workflow preservation

14 Infrastructure III: Metadata
HEP Data Model Workshop (“VoCamp15ND”) Participants from HEP, Libraries, & Ontology Community* *new collaborations for DASPOS Define preliminary Data Models for CERN Analysis Portal describe: main high-level elements of an analysis main research objects main processing workflows and products main outcomes of the research process re-use components of developed formal ontologies PROV, Computational Observation Pattern, HEP Taxonomy, etc. Patterns implemented in JSON-LD format for use in CERN Analysis Portal will enable discovery, cross-linking of analysis descriptions

15 Detector Final State Description
published paper at “International Conference on Knowledge Engineering and Knowledge Management” Extraction ( of test data sets from CMS and ATLAS publications to examine pattern usability and ability facilitate data access across experiments

16 Computational Activity
Continued testing and validation of the Computational Activity and Computational Environment patterns Work on aligning pattern with other vocabularies for software annotation and attribution, including Github and Mozilla Science led “Code as a research object” effort (

17 Overall Metadata work structure
Integration of patterns into a knowledge flow system that captures provenance and reproducibility information from a computational perspective as well as links to ”higher level” metadata descriptions of the data in terms of physics vocabularies

18 Technology I: Containers
Tools like chroot and Docker sandbox the execution of an application Offer the ability to convert application to a container/image Virtualize only essential functions of the compute node environment, allow local system to provide the rest much faster computation becoming the preferred solution over VMs for many computing environments Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Server Host OS Docker Engine Bin/Libs App A App B Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container

19 Technology I: Containers
Portability = Preservation! Tools like chroot and Docker sandbox the execution of an application Offer the ability to convert application to a container/image Virtualize only essential functions of the compute node environment, allow local system to provide the rest much faster computation becoming the preferred solution over VMs for many computing environments Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Server Host OS Docker Engine Bin/Libs App A App B Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container

20 Technology II: Smart Containers

21 4 5 Smart Containers Search Add machine-readable labels
API to write metadata Metadata storage and strandardization Specification of data location 4 Add machine-readable labels 5 Link things together into a knowledge graph

22 Containers Workshop

23 Containers Workshop

24 Containers Workshop Captured surging interest in container technologies for all manner of applications Attendees included: Principle Software Engineer of RedHat Senior Software Engineer for Docker ReproZip developers CS Specialists in containers and virtualization Preservation examples: OpenMalaria Bertini (Numerical Algebraic Geometry) HEP analysis

25 Technology III: CDF/DØ
As part of the effort to preserve software, executables, and data for the Tevatron experiments, we performed the pilot installation of the DØ code base outside of Fermilab. use cvmfs to deliver code to any node outside of FNAL, executables running under VMs integrated into batch system data delivered remotely by SAM protocol Used W mass as template analysis Full analysis chain, including remote access to data, code, and executable versions of the software was demonstrated

26 CERN Analysis Preservation Portal
Tools: Run containers/workflows Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? Data Archive Metadata Data Container Cluster CERN OpenStack “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run store

27 RECAST + CERN Analysis Portal
Streamlined demonstration of full preservation chain “RECAST”: re-use of completed/preserved analysis with different inputs for data comparison special schema developed specifically to describe these steps: “packtivity” bundles executable (docker container), environment, executable description specifies individual processing stages “yadage”: captures how pieces fit together into a parametrized workflow CERN Analysis Preservation Portal can store the descriptions of these processes allows for re-use of stored processing chain

28 RECAST + CERN Analysis Portal
Workflow schematic: As stored in CAP

29 RECAST + CERN Analysis Portal
Instructions and workflow descriptions can be extracted from CAP and used to instantiate jobs based on the stored info prototype RECAST Cluster infrastructure (website, result storage, message passing, job queue, workflow engine) fully dockerized themselves Can deploy RECAST service instance to any docker swarm cloud (Carina, Google Container Engine, CERN Container Project) each of these is a re-execution of a preserved ATLAS analysis

30 Collaborations/Spin-offs
RDA: Preservation/Reproducibility Open Science Framework pioneering Campus/OSF interactions Wright State Ontology specialists National Data Service Dashboard, archived computations, containers DIANA collaboration on goals, some preservation efforts

31 Next Steps Another scouting expedition?
Our goal is ultimately to change how science is done in a computing context so that it has greater integrity and productivity. We have developed some prototype techniques (in DASPOS1) that improve the expression and archival of artifacts. Going forward, we want to study how the systematic application of these techniques can enable new, higher level scientific reasoning about a very large body (multidisciplinary) of work.   For this to have impact, we will develop small communities of practice that will apply these techniques using the archives and tools relevant to their discipline.  Another way to phrase this might be: to study/prototype the kinds of knowledge preservation tools that might make doing science easier and would enable broader/better science.


Download ppt "Mike Hildreth representing the DASPOS project"

Similar presentations


Ads by Google