Long Term Data Preservation Frank Berghaus On Behalf of the DPHEP Collaboration 06/03/15 Data Preservation - Frank Berghaus1.

Slides:



Advertisements
Similar presentations
1 Databases in ALICE L.Betev LCG Database Deployment and Persistency Workshop Geneva, October 17, 2005.
Advertisements

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
O. Stézowski IPN Lyon AGATA Week September 2003 Legnaro Data Analysis – Team #3 ROOT as a framework for AGATA.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
TPAC Digital Library Talk Overview Presenter:Glenn Hyland Tasmanian Partnership for Advanced Computing & Australian Antarctic Division Outline: TPAC Overview.
SPI Software Process & Infrastructure GRIDPP Collaboration Meeting - 3 June 2004 Jakub MOSCICKI
Tools and software process for the FLP prototype B. von Haller 9. June 2015 CERN.
SPI Software Process & Infrastructure EGEE France - 11 June 2004 Yannick Patois
1 port BOSS on Wenjing Wu (IHEP-CC)
The CAVES Project Collaborative Analysis Versioning Environment System The CODESH Project COllaborative DEvelopment SHell Dimitri Bourilkov University.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Virtual Logbooks and Collaboration in Science and Software Development Dimitri Bourilkov, Vaibhav Khandelwal, Archis Kulkarni, Sanket Totala University.
A. Aimar - EP/SFT LCG - Software Process & Infrastructure1 Software Process panel SPI GRIDPP 7 th Collaboration Meeting 30 June – 2 July 2003 A.Aimar -
1 C.Diaconu, DPHEP3, CERN, December 7-9, 2009 Blueprint Start the production of a detailed document on data preservation – Gets in details of the individual.
ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.
F. Rademakers - CERN/EPLinux Certification - FOCUS Linux Certification Fons Rademakers.
MINER A Software The Goals Software being developed have to be portable maintainable over the expected lifetime of the experiment extensible accessible.
ATLAS Data Challenges US ATLAS Physics & Computing ANL October 30th 2001 Gilbert Poulard CERN EP-ATC.
Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.
SEAL Core Libraries and Services CLHEP Workshop 28 January 2003 P. Mato / CERN Shared Environment for Applications at LHC.
Changes to CernVM-FS repository are staged on an “installation box" using a read/write file system interface. There is a dedicated installation box for.
CERN - IT Department CH-1211 Genève 23 Switzerland t COOL Conditions Database for the LHC Experiments Development and Deployment Status Andrea.
GDB Meeting - 10 June 2003 ATLAS Offline Software David R. Quarrie Lawrence Berkeley National Laboratory
DPHEP Workshop CERN, December Predrag Buncic (CERN/PH-SFT) CernVM R&D Project Portable Analysis Environments using Virtualization.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Firmware - 1 CMS Upgrade Workshop October SLHC CMS Firmware SLHC CMS Firmware Organization, Validation, and Commissioning M. Schulte, University.
Feedback from LHC Experiments on using CLHEP Lorenzo Moneta CLHEP workshop 28 January 2003.
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
Exploring the boundaries of MARC21 — creating a metadata schema for the CERN Open Data Portal Patricia Herterich CERN GS-SIS, Humboldt-Universität zu Berlin.
Software Engineering Overview DTI International Technology Service-Global Watch Mission “Mission to CERN in Distributed IT Applications” June 2004.
G.Govi CERN/IT-DB 1 September 26, 2003 POOL Integration, Testing and Release Procedure Integration  Packages structure  External dependencies  Configuration.
Feb. 14, 2002DØRAM Proposal DØ IB Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) Introduction Partial Workshop Results DØRAM Architecture.
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen Budapest
12 March, 2002 LCG Applications Area - Introduction slide 1 LCG Applications Session LCG Launch Workshop March 12, 2002 John Harvey, CERN LHCb Computing.
Data Preservation in Particle Physics GORDON WATTS JANUARY 18, 2016 ACAT 2016.
Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
Preservation of LEP Data There is still hope Is there? Marcello Maggi, Ulrich Schwickerath, Matthias Schröder, , DPHEP7 1.
ATLAS Data preservation April 2015 Roger Jones for the ATLAS Collaboration.
36 th LHCb Software Week Pere Mato/CERN.  Provide a complete, portable and easy to configure user environment for developing and running LHC data analysis.
Follow-up to SFT Review (2009/2010) Priorities and Organization for 2011 and 2012.
NA61 Collaboration Meeting CERN, December Predrag Buncic, Mihajlo Mudrinic CERN/PH-SFT Enabling long term data preservation.
Predrag Buncic (CERN/PH-SFT) CernVM Status. CERN, 24/10/ Virtualization R&D (WP9)  The aim of WP9 is to provide a complete, portable and easy.
SPI Software Process & Infrastructure Project Plan 2004 H1 LCG-PEB Meeting - 06 April 2004 Alberto AIMAR
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
Usecases: 1.ISIS Neutron Source 2.DP for HEP Matthew Viljoen STFC, UK APARSEN-EGI workshop: preserving big data for research Amsterdam Science Park 4-6.
Web Application & Validation Hans Wenzel 20 th Geant4 Collaboration Meeting September 28, 2015.
School on Grid & Cloud Computing International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics.
Application Support Environment Based on experience in High Energy Physics at CERN Presented at the UNESCO/CERN Workshop April 2002 Jürgen Knobloch.
QC-specific database(s) vs aggregated data database(s) Outline
Mike Hildreth representing the DASPOS project
HEP LTDP Use Case & EOSC Pilot
An Approach to Software Preservation
Virtualisation for NA49/NA61
NA61/NA49 virtualisation:
Blueprint of Persistent Infrastructure as a Service
Dag Toppe Larsen UiB/CERN CERN,
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
EOSCpilot WP4: Use Case 5 Material for
Dag Toppe Larsen UiB/CERN CERN,
ALICE analysis preservation
Virtualisation for NA49/NA61
CernVM Status Report Predrag Buncic (CERN/PH-SFT).
WLCG Collaboration Workshop;
Module 01 ETICS Overview ETICS Online Tutorials
Ruth Pordes, Lauri Loebel Carpenter, Elizabeth Schermerhorn
ATLAS DC2 & Continuous production
Presentation transcript:

Long Term Data Preservation Frank Berghaus On Behalf of the DPHEP Collaboration 06/03/15 Data Preservation - Frank Berghaus1

Objectives Preserve data, software, and know-how in the collaborations: Data/Bit Preservation (MoU between CERN and tier1’s) Analysis Preservation Preserve software evolution alongside data Timescale: “Forever” 30+ years past experiment life Share data, software, and know-how: Larger Scientific Community Education and Outreach CernVM is well placed: Many experiments (LHC and others) already use CernVM and CVMFS Virtualization is ubiquitous, and probably won’t go away 06/03/15 Data Preservation - Frank Berghaus2

Data Preservation & Sharing LHC experiments defined strategy and scope defined in these policy documentspolicy documents Many overlapping requirements and tasks! Example experiments preserving data: 06/03/15 Data Preservation - Frank Berghaus3 ExperimentApproach BaBarVirtual machines & Infrastructure servers HERMESWeb/wiki, Logbook, Mailing lists, dcache, AFS, BIRD batch, GRID ALEPHRunning code in a SL6 VM, Open to CernVM & CVMFS BelleReformatting data for BelleII software CDFPlans to preserve all data & software, R&D stage

Analysis Preservation Goal: Reproducibility for the collaboration Capture physics analysis code (Snapshot?) Libraries and compiler -> depend on hardware Virtualization easier to port than individual software Analysis metadata Experiment conditions Software provenance Input Data 06/03/15 Data Preservation - Frank Berghaus4

Analysis Preservation: Data Input data RAW data and Reconstruction code Software provenance & full database (alignment, conditions, etc.) Or capture input data with analysis per analysis Limits scientific reach Data volume may not be feasible to capture Capture analysis and production software environment 06/03/15 Data Preservation - Frank Berghaus5

Capture Approaches Store static virtual machine with analysis Capture all code, database, and environment in single large (many GB) VM image Use a set of Docker containers for each analysis Create system containers each hosting a part of the necessary services Idea: Using docker to capture analysis infrastructure and using CernVM as consistent host? Contextualize CernVM from analysis metadata Use CVMFS to provide operating system, compilers, experiment software & databases, and external libraries for physics analysis code 06/03/15 Data Preservation - Frank Berghaus6

Analysis Capture Framework Invenio as mechanism to capture publications and metadata invenio-software.org Contextualize or create service orchestration from metadata Demonstration portal for analysis and metadata capture is analysis-preservation.cern.ch 06/03/15 Data Preservation - Frank Berghaus7

Analysis Preservation Demo Portal 06/03/15 Data Preservation - Frank Berghaus8

Analysis Preservation Demo Portal 06/03/15 Data Preservation - Frank Berghaus9

Metadata capture Goal: script to capture analysis environment information Upload information to the portal Retrieve software provenance Parse input data through existing experiment provenance systems (AMI, etc.) Could this be used with cvmfs tagging? Analysis preservation portal to allow additions, modifications, review, and archival What helper scripts/API already exist? 06/03/15 Data Preservation - Frank Berghaus10

Running Captured Analyses Start with RECAST Capture analysis for Model A with final state Model B has same final state Reinterpret captured analysis under new model 06/03/15 Data Preservation - Frank Berghaus11 From Kyle Cranmer & Lukas Heinrich

RECAST Frontend accepts requests for processing: recast.perimeterinstitute.ca Collaboration approves requests for batch processing at the control center: recast-demo.cern.ch Backend processing is done on CernVM running CERN OpenStack Software requirements (Repository suggestions?): Rivet - analysis infrastructure Fast simulation - FastSim, ATOM, etc. Full experiment simulation Scalable processing backend Condor/cloud scheduler – proven to work with CernVM OpenStack HEAT template? 06/03/15 Data Preservation - Frank Berghaus12

Ideas from yesterday 06/03/15 Data Preservation - Frank Berghaus13

Open Access: Science Reinterpreting existing analysis for new model Easy to use Needs interface & resources ATLAS: Developing RECAST Access to experiment software, data, and simulation Introduced in Kati Lassila-Perini’s talk on Thursdaytalk Follow “Research” at: Ioannes’ WebAPI would be amazing here! Ioannes’ WebAPI CMS: 50% (~27TB) of 2010 data released ALICE: Will release 10TB of 2010 data this year 06/03/15 Data Preservation - Frank Berghaus14

Summary/Questions CernVM is a great candidate for preservation: LHC experiments and some others leverage CernVM and CVMFS already What about supporting other disciplines (i.e. non SL distributions)? Could we distribute small (analysis-level) databases via cvmfs? What about large databases needed for production? Where would cvmfs versioning be useful? 06/03/15 Data Preservation - Frank Berghaus15

06/03/15 Data Preservation - Frank Berghaus16 Backup

Aside - Zenodo: Fringe science Stores publication with data and code Often generic code using open source tools, e.g. R, SciPy CernVM WebApp could be useful to create test analysis environment? 06/03/15 Data Preservation - Frank Berghaus17

Data Preservation outside HEP Examples from outside HEP: Space & Astronomy: NVO, EURO-VONVOEURO-VO Earth & Ocean: PANGEA,PANGEA Life Sciences: ELIXIRELIXIR Minimize software dependence by using ubiquitous and well documented standard formats No current standard data format for HEP HEP Experiments, simulation, and analysis require complex tools 06/03/15 Data Preservation - Frank Berghaus18

CVMFS For Preservation Pros: Many experiments already use cvmfs CERN and Tier1’s agreed to maintain infrastructure (MoU) Cons: Requires infrastructure Experiment must use CVMFS Use beyond LHC experiments Is it reasonable to expect CVMFS adoption? Ease of using non-SL, non-RPM linuxes? 06/03/15 Data Preservation - Frank Berghaus19

Existing Approaches in HEP BaBarH1ZEUSHERMESBelleBESIIICDFD0 End of DAQ OSSL 3/5 RHEL 3/5 SL 5 SL 3/5SL 5 RHEL 5 SL5SL 5/6SL5 Languages C++ Python Java C C++ Fortran Python C++ C C++ Fortran Python C C++ Fortran C++ C C++ Python C++ SimulationGEANT 4GEANT 3 GEANT 4GEANT 3 External Dep’s ACE CERNLIB CLHEP CMLOG Flex GNU Bison MySQL Oracle ROOT TCL XRootD CERNLIB FastJet NeuroBays Oracle ROOT ADAMO CERNLIB ROOT Boost CERNLIB CLHEP NeuroBays PostgresQL ROOT CASTPR CERNLIB CLHEP HepMC ROOT CERNLIB NeuroBayes Oracle ROOT Oracle ROOT 06/03/15 Data Preservation - Frank Berghaus20 From: DPHEP

Existing Approaches at CERN ALICEATLASCMSLHCbALEPHDELPHIL3OPAL End of DAQ~ OSSL 6?SL 6SL 5/6SL 6????? Languages C++ Python C++ Fortran Python C++ Python C++ Python ???? External Dep’s ROOT ? ROOT ? ROOT ? ROOT ? CERNLIB ? ??? 06/03/15 Data Preservation - Frank Berghaus21 From: DPHEP

Open Data: Open Access Access to experiment software and software See “Research” at: Note: Ioannes’ WebAPI would be amazing here! HEP data requires custom, and complex code For each experiment Code usually not portable, requires: Specific libraries Compiler version Operating system & architecture Virtualization is ubiquitous and provides: Libraries, compilers, and software tools 06/03/15 Data Preservation - Frank Berghaus22

Existing Approaches in HEP 06/03/15 Data Preservation - Frank Berghaus23 ExperimentApproach BaBarVirtual machines & Infrastructure servers H1People doing analysis, Data stored ZEUSPeople doing analysis, Data stored HERMESWeb/wiki, Logbook, Mailing lists, dcache, AFS, BIRD batch, GRID BelleDeveloping BelleII code to be backwards compatible BESIIIThinking about preservation CDFPlans to preserve all data & software, R&D stage D0Under discussion, fraction of a FTE working on it From: DPHEP BaBar showed that virtualization works for preservation.

Existing Approaches at CERN 06/03/15 Data Preservation - Frank Berghaus24 ExperimentStatus ALICEInterested in CernVM & cvmfs ATLASInvestigating docker, forward porting data, open to CernVM CMSInterested in CernVM & cvmfs, forward porting data LHCbInterested in CernVM & cvmfs ALEPHPorting code to modern OS & compilers. Building a VM, Open to CernVM & CVMFS DELPHI? L3? OPALPorting code to modern OS & compilers. Building a VM, Open to CernVM & CVMFS Even forward porting data requires validation against old release.