Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mike Hildreth representing the DASPOS Team

Similar presentations


Presentation on theme: "Mike Hildreth representing the DASPOS Team"— Presentation transcript:

1 Mike Hildreth representing the DASPOS Team
The DASPOS Project Mike Hildreth representing the DASPOS Team

2 DASPOS Data And Software Preservation for Open Science
multi-disciplinary effort Notre Dame, Chicago, UIUC, Washington, Nebraska, NYU, (Fermilab, BNL) Links HEP effort (DPHEP+experiments) to Biology, Astrophysics, Digital Curation, and other disciplines includes physicists, digital librarians, computer scientists aim to achieve some commonality across disciplines in meta-data descriptions of archived data What’s in the data, how can it be used? computational description (ontology/metadata development) how was the data processed? can computation replication be automated? impact of access policies on preservation infrastructure

3 DASPOS In parallel, will build test technical infrastructure to implement a knowledge preservation system “Scouting party” to figure out where the most pressing problems lie, and some solutions incorporate input from multi-disciplinary dialogue, use- case definitions, policy discussions Will translate needs of analysts into a technical implementation of meta-data specification Will develop means of specifying processing steps and the requirements of external infrastructure (databases, etc.) Will implement “physics query” infrastructure across small- scale distributed network End result: “template architecture” for data/software/knowledge preservation systems

4 DASPOS Overview Digital Librarian Expertise
How to catalogue and share data How to curate and archive large digital collections Computer Science Expertise How to build databases and query infrastructure How to develop distributed storage networks Science Expertise What does the data mean? How was it processed? How will it be re-used

5 DASPOS Process Multi-pronged approach for individual topics
NYU/Nebraska: RECAST and other developments UIUC/Chicago: Workflows, Containers ND: Metadata, Containers, Workflows, Environment specification Shared validation & examples Workshops & All-hands meetings Shared collaboration with CERN, DPHEP Outreach to other disciplines

6 Prototype Architecture
Container Cluster Test bed Capable of running containerized processes “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? store Data Archive Data Tools: Run containers/workflows Discovery/exploration Unpack/analyze Policy & Curation Access Policies Public archives? Domain-specific Inspire Data path Metadata links

7 Prototype Architecture
Container Cluster Test bed Capable of running containerized processes “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? store Data Archive Data Tools: Run containers/workflows Discovery/exploration Unpack/analyze Policy & Curation Access Policies Public archives? Domain-specific Inspire ~ Done Under development Not done Data path Metadata links

8 Infrastructure I: Environment Capture

9 Umbrella

10 Umbrella Current version of Umbrella can work with:
Docker – create container, mount volumes. Parrot – Download tarballs, mount at run=me. Amazon – allocate VM, copy and unpack tarballs. Condor – Request compatible machine. Open Science Framework – deploy uploaded containers Example Umbrella Apps: Povray ray-tracing application OpenMalaria simulation CMS high energy physics simulation

11 Infrastructure II: Workflow Capture

12 PRUNE

13 PRUNE Works across multiple workflow repositories
Is interfaced with Umbrella for environment specification on multiple platforms reproducible, flexible workflow preservation

14 Infrastructure III: Metadata
HEP Data Model Workshop (“VoCamp15ND”) Participants from HEP, Libraries, & Ontology Community* *new collaborations for DASPOS Define preliminary Data Models for CERN Analysis Portal describe: main high-level elements of an analysis main research objects main processing workflows and products main outcomes of the research process re-use components of developed formal ontologies PROV, Computational Observation Pattern, HEP Taxonomy, etc. Patterns implemented in JSON-LD format for use in CERN Analysis Portal will enable discovery, cross-linking of analysis descriptions

15 Detector Final State Description
published paper at “International Conference on Knowledge Engineering and Knowledge Management” Extraction ( of test data sets from CMS and ATLAS publications to examine pattern usability and ability facilitate data access across experiments

16 Computational Activity
Continued testing and validation of the Computational Activity and Computational Environment patterns Work on aligning pattern with other vocabularies for software annotation and attribution, including Github and Mozilla Science led “Code as a research object” effort (

17 Overall Metadata work structure
Integration of patterns into a knowledge flow system that captures provenance and reproducibility information from a computational perspective as well as links to ”higher level” metadata descriptions of the data in terms of physics vocabularies

18 Technology I: Containers
Tools like chroot and Docker sandbox the execution of an application Offer the ability to convert application to a container/image Virtualize only essential functions of the compute node environment, allow local system to provide the rest much faster computation becoming the preferred solution over VMs for many computing environments Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Server Host OS Docker Engine Bin/Libs App A App B Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container

19 Technology I: Containers
Portability = Preservation! Tools like chroot and Docker sandbox the execution of an application Offer the ability to convert application to a container/image Virtualize only essential functions of the compute node environment, allow local system to provide the rest much faster computation becoming the preferred solution over VMs for many computing environments Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Server Host OS Docker Engine Bin/Libs App A App B Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container

20 Technology II: Smart Containers

21 4 5 Smart Containers Search Add machine-readable labels
API to write metadata Metadata storage and strandardization Specification of data location 4 Add machine-readable labels 5 Link things together into a knowledge graph

22 Containers Workshop

23 RECAST “Analysis” Data Workflow New Models
Preserved workflows can be used to compare new models with a published analysis Reinterpretation possible with full detector simulation, analysis chain “Folding” rather than “Unfolding” like in HEPData

24 CERN Analysis Portal & REANA

25 REANA

26 Workflow Preservation
Individual processing steps: packtivity bundles executable (docker container), environment, executable description working on implementation of step description with umbrella either create containers for submission or run on separate back-end yadage captures how pieces fit together into a parametrized workflow allows for re-use of stored processing chain, component by component JSON Specification of workflow: Lukas Heinrich much of original infrastructure developed by

27 REANA Workflows Workflow schematic: As stored in CAP

28 Next Steps: DASPOS 2.0? Another scouting expedition?
Our goal is ultimately to change how science is done in a computing context so that it has greater integrity and productivity. We have developed some prototype techniques (in DASPOS1) that improve the expression and archival of artifacts. Going forward, we want to study how the systematic application of these techniques can enable new, higher level scientific reasoning about a very large body (multidisciplinary) of work.   For this to have impact, we will develop small communities of practice that will apply these techniques using the archives and tools relevant to their discipline.  Another way to phrase this might be: to study/prototype the kinds of knowledge preservation tools that might make doing science easier and would enable broader/better science.

29 References Douglas Thain, Peter Ivie, and Haiyan Meng,Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?,12th International Conference on Digital Preservation (iPres), November, DOI: /R0CZ353M Umbrella: Haiyan Meng and Douglas Thain, Umbrella: A Portable Environment Creator for Reproducible Computing on Clusters, Clouds, and Grids, Workshop on Virtualization Technologies in Distributed Computing (VTDC) at HPDC, June, DOI: / Haiyan Meng, Rupa Kommineni, Quan Pham, Robert Gardner, Tanu Malik and Douglas Thain (2015). An Invariant Framework for Conducting Reproducible Computational Science.  Journal of Computational Science. April, DOI: /j.jocs And the parrot packaging work as well: Haiyan Meng, Matthias Wolf, Peter Ivie, Anna Woodard, Michael Hildreth, Douglas Thain, A Case Study in Preserving a High Energy Physics Application with Parrot, Journal of Physics: Conference Series (CHEP 2015), December, 2015. RECAST demo: Metadata work: K. Janowicz, P. Hitzler, B. Adams, D. Kolas, C. Vardeman II (2014). Five Stars of Linked Data Vocabulary Use.  Semantic Web Journal. 5 (3),  17376 Charles Vardeman II, Adila Krisnadhi, Michelle Cheatham, Krzysztof Janowicz, Holly Ferguson, Pascal Hitzler, Aimee P. C. Buccellato (2015). An Ontology Design Pattern and Its Use Case for Modeling Material Transformation.  Semantic Web Journal, to appear.


Download ppt "Mike Hildreth representing the DASPOS Team"

Similar presentations


Ads by Google