Mike Hildreth representing the DASPOS project

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
M.A.Doman Model for enabling the delivery of computing as a SERVICE.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
Jisc Data Spring Pitch: Cloud Workbench Ben Butchart EDINA.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
Distributed Aircraft Maintenance Environment - DAME DAME Workflow Advisor Max Ong University of Sheffield.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger September 29, 2009.
Interoperability Grids, Clouds and Collaboratories Ruth Pordes Executive Director Open Science Grid, Fermilab.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Long Term Data Preservation Frank Berghaus On Behalf of the DPHEP Collaboration 06/03/15 Data Preservation - Frank Berghaus1.
Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness? Douglas Thain, Peter Ivie, and Haiyan Meng.
System Development & Operations NSF DataNet site visit to MIT February 8, /8/20101NSF Site Visit to MIT DataSpace DataSpace.
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
Meeting of the Member States Expert Group on Digitisation and Digital Preservation , Luxembourg European Archival Records and Knowledge Preservation.
| 1 Anita de Waard, VP Research Data Collaborations Elsevier RDM Services May 20, 2016 Publishing The Full Research Cycle To Support.
”Smart Containers” Charles F. Vardeman II, Da Huo, Michelle Cheatham, James Sweet, and Jaroslaw Nabrzyski
School on Grid & Cloud Computing International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics.
Structured Container Delivery Oscar Renalias Accenture Container Lead (NOTE: PASTE IN PORTRAIT AND SEND BEHIND FOREGROUND GRAPHIC FOR CROP)
Accessing the VI-SEEM infrastructure
Containers as a Service with Docker to Extend an Open Platform
Mike Hildreth representing the DASPOS Team
Pasquale Pagano (CNR-ISTI) Project technical director
Fundamentals Sunny Sharma Microsoft
Tokamak data mirror for JET and MAST Moving towards an open data repository for European nuclear fusion research.
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
Budget JRA2 Beneficiaries Description TOT Costs incl travel
Tools and Services Workshop
EOSC MODEL Pasquale Pagano CNR - ISTI
Joslynn Lee – Data Science Educator
ReproZip: Computational Reproducibility With Ease
Jarek Nabrzyski Director, Center for Research Computing
Population Imaging Use Case - EuroBioImaging
INTAROS WP5 Data integration and management
Docker Birthday #3.
DataNet Collaboration
An Overview of Data-PASS Shared Catalog
Mike Hildreth representing the DASPOS Team
Data Ingestion in ENES and collaboration with RDA
Summit 2017 Breakout Group 2: Data Management (DM)
Joseph JaJa, Mike Smorul, and Sangchul Song
Publishing software and data
University of Technology
Graduation Project Kick-off presentation - SET
Monitoring of the infrastructure from the VO perspective
Cloud Computing Dr. Sharad Saxena.
Haiyan Meng and Douglas Thain
Reproducible Science Gordon Watts (University of Center for modeling complex interactions G. Watts (UW/Seattle)
Module 01 ETICS Overview ETICS Online Tutorials
Common Solutions to Common Problems
Technical Capabilities
LOD reference architecture
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Building an open library without walls : Archiving of particle physics data and results for long-term access and use Joanne Yeomans CERN Scientific Information.
Jisc Research Data Shared Service (RDSS)
Overview of Workflows: Why Use Them?
Bird of Feather Session
MSDI training courses feedback MSDIWG10 March 2019 Busan
Red Sky Update “Watching the horizon for emerging health threats”
Azure Container Service
Harrison Howell CSCE 824 Dr. Farkas
VIFI : Virtual Information Fabric for Data-Driven Discovery from Distributed Fragmented Repositories PI: Dr. Ashit Talukder Bank of America Endowed Chair.
Presentation transcript:

Mike Hildreth representing the DASPOS project DASPOS Update Mike Hildreth representing the DASPOS project

DASPOS Data And Software Preservation for Open Science multi-disciplinary effort Notre Dame, Chicago, UIUC, Washington, Nebraska, NYU, (Fermilab, BNL) Links HEP effort (DPHEP+experiments) to Biology, Astrophysics, Digital Curation, and other disciplines includes physicists, digital librarians, computer scientists aim to achieve some commonality across disciplines in meta-data descriptions of archived data What’s in the data, how can it be used? computational description (ontology/metadata development) how was the data processed? can computation replication be automated? impact of access policies on preservation infrastructure

DASPOS In parallel, will build test technical infrastructure to implement a knowledge preservation system “Scouting party” to figure out where the most pressing problems lie, and some solutions incorporate input from multi-disciplinary dialogue, use- case definitions, policy discussions Will translate needs of analysts into a technical implementation of meta-data specification Will develop means of specifying processing steps and the requirements of external infrastructure (databases, etc.) Will implement “physics query” infrastructure across small- scale distributed network End result: “template architecture” for data/software/knowledge preservation systems

DASPOS Overview Digital Librarian Expertise How to catalogue and share data How to curate and archive large digital collections Computer Science Expertise How to build databases and query infrastructure How to develop distributed storage networks Science Expertise What does the data mean? How was it processed? How will it be re-used

DASPOS Process Multi-pronged approach for individual topics NYU/Nebraska: RECAST and other developments UIUC/Chicago: Workflows, Containers ND: Metadata, Containers, Workflows, Environment specification Shared validation & examples Workshops & All-hands meetings Shared collaboration with CERN, DPHEP Outreach to other disciplines

Prototype Architecture Container Cluster Test bed Capable of running containerized processes “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? store Data Archive Data Tools: Run containers/workflows Discovery/exploration Unpack/analyze Policy & Curation Access Policies Public archives? Domain-specific Inspire Data path Metadata links

Prototype Architecture Container Cluster Test bed Capable of running containerized processes “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? store Data Archive Data Tools: Run containers/workflows Discovery/exploration Unpack/analyze Policy & Curation Access Policies Public archives? Domain-specific Inspire ~ Done Under development Not done Data path Metadata links

Infrastructure I: Environment Capture

Umbrella

Umbrella Current version of Umbrella can work with: Docker – create container, mount volumes. Parrot – Download tarballs, mount at run=me. Amazon – allocate VM, copy and unpack tarballs. Condor – Request compatible machine. Open Science Framework – deploy uploaded containers Example Umbrella Apps: Povray ray-tracing application http://dx.doi.org/doi:10.7274/R0BZ63ZT OpenMalaria simulation http://dx.doi.org/doi:10.7274/R03F4MH3 CMS high energy physics simulation http://dx.doi.org/doi:10.7274/R0765C7T

Infrastructure II: Workflow Capture

PRUNE

PRUNE Works across multiple workflow repositories Is interfaced with Umbrella for environment specification on multiple platforms reproducible, flexible workflow preservation

Infrastructure III: Metadata HEP Data Model Workshop (“VoCamp15ND”) Participants from HEP, Libraries, & Ontology Community* *new collaborations for DASPOS Define preliminary Data Models for CERN Analysis Portal describe: main high-level elements of an analysis main research objects main processing workflows and products main outcomes of the research process re-use components of developed formal ontologies PROV, Computational Observation Pattern, HEP Taxonomy, etc. Patterns implemented in JSON-LD format for use in CERN Analysis Portal will enable discovery, cross-linking of analysis descriptions

Detector Final State Description published paper at “International Conference on Knowledge Engineering and Knowledge Management” http://ekaw2016.cs.unibo.it Extraction (https://github.com/gordonwatts/HEPOntologyParserExperiments) of test data sets from CMS and ATLAS publications to examine pattern usability and ability facilitate data access across experiments

Computational Activity Continued testing and validation of the Computational Activity and Computational Environment patterns https://github.com/Vocamp/ComputationalActivity). Work on aligning pattern with other vocabularies for software annotation and attribution, including Github and Mozilla Science led “Code as a research object” effort (https://github.com/codemeta/codemeta)

Overall Metadata work structure Integration of patterns into a knowledge flow system that captures provenance and reproducibility information from a computational perspective as well as links to ”higher level” metadata descriptions of the data in terms of physics vocabularies

Technology I: Containers Tools like chroot and Docker sandbox the execution of an application Offer the ability to convert application to a container/image Virtualize only essential functions of the compute node environment, allow local system to provide the rest much faster computation becoming the preferred solution over VMs for many computing environments Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Server Host OS Docker Engine Bin/Libs App A App B Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container

Technology I: Containers Portability = Preservation! Tools like chroot and Docker sandbox the execution of an application Offer the ability to convert application to a container/image Virtualize only essential functions of the compute node environment, allow local system to provide the rest much faster computation becoming the preferred solution over VMs for many computing environments Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Server Host OS Docker Engine Bin/Libs App A App B Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container

Technology II: Smart Containers

4 5 Smart Containers Search Add machine-readable labels API to write metadata Metadata storage and strandardization Specification of data location 4 Add machine-readable labels 5 Link things together into a knowledge graph

Containers Workshop

Containers Workshop

Containers Workshop Captured surging interest in container technologies for all manner of applications Attendees included: Principle Software Engineer of RedHat Senior Software Engineer for Docker ReproZip developers CS Specialists in containers and virtualization Preservation examples: OpenMalaria Bertini (Numerical Algebraic Geometry) HEP analysis

Technology III: CDF/DØ As part of the effort to preserve software, executables, and data for the Tevatron experiments, we performed the pilot installation of the DØ code base outside of Fermilab. use cvmfs to deliver code to any node outside of FNAL, executables running under VMs integrated into batch system data delivered remotely by SAM protocol Used W mass as template analysis Full analysis chain, including remote access to data, code, and executable versions of the software was demonstrated

CERN Analysis Preservation Portal Tools: Run containers/workflows Preservation Archive Metadata Container images Workflow images Instructions to reproduce Data? Data Archive Metadata Data Container Cluster CERN OpenStack “Containerizer Tools” PTU, Parrot scripts Used to capture processes Deliverable: stored in DASPOS git run store

RECAST + CERN Analysis Portal Streamlined demonstration of full preservation chain “RECAST”: re-use of completed/preserved analysis with different inputs for data comparison special schema developed specifically to describe these steps: “packtivity” bundles executable (docker container), environment, executable description specifies individual processing stages “yadage”: captures how pieces fit together into a parametrized workflow CERN Analysis Preservation Portal can store the descriptions of these processes allows for re-use of stored processing chain

RECAST + CERN Analysis Portal Workflow schematic: As stored in CAP

RECAST + CERN Analysis Portal Instructions and workflow descriptions can be extracted from CAP and used to instantiate jobs based on the stored info prototype RECAST Cluster infrastructure (website, result storage, message passing, job queue, workflow engine) fully dockerized themselves Can deploy RECAST service instance to any docker swarm cloud (Carina, Google Container Engine, CERN Container Project) each of these is a re-execution of a preserved ATLAS analysis

Collaborations/Spin-offs RDA: Preservation/Reproducibility Open Science Framework pioneering Campus/OSF interactions Wright State Ontology specialists National Data Service Dashboard, archived computations, containers DIANA collaboration on goals, some preservation efforts

Next Steps Another scouting expedition? Our goal is ultimately to change how science is done in a computing context so that it has greater integrity and productivity. We have developed some prototype techniques (in DASPOS1) that improve the expression and archival of artifacts. Going forward, we want to study how the systematic application of these techniques can enable new, higher level scientific reasoning about a very large body (multidisciplinary) of work.   For this to have impact, we will develop small communities of practice that will apply these techniques using the archives and tools relevant to their discipline.  Another way to phrase this might be: to study/prototype the kinds of knowledge preservation tools that might make doing science easier and would enable broader/better science.