Mirjam van Daalen, (Stephan Egli, Derek Feichtinger) :: Paul Scherrer Institut Status Report PSI PaNDaaS2 meeting Grenoble 12 – 13 December 2016
Current projects at PSI Data Analysis Service Data Policy Remote access Metadata catalogue Petabyte archive Remote data transfer PSI, PSI, 10. April 2019 10. April 2019
Covering larger parts of the life cycle
Project Overview: Data Analysis Service SUK Project 142-004 Project Manager: Dr. Stephan Egli, Dr. Derek Feichtinger, Paul Scherrer Institut Partner: ETHZ Project Duration: 04.2015 - 04.2017 Financing Support (50% matching funds): CHF 1'618'000 Team members: 16 (including 3 new positions financed by project) Workpackages WP1: Common Tools and Services WP2: Data Analysis Environments for major use cases WP3: Identity Management, DUO, Authentication and Authorization WP4: Integration and development of scientific analysis codes WP5: Procurement, installation, operation of analysis cluster infrastructure WP6: Infrastructure sharing with other institutions WP7: Project Management
DaaS Project Status Main purpose: provide an integrated solution for all SLS Users to do offline data analysis for data taken at SLS (and later SwissFel) Cluster of moderate size (~900 Cores, 2 PB Storage) Hired 3 persons dedicated to this project . Currently about 50% of the foreseen hardware installed and in operation Now in test phase with invited external users and internal users Adjusting the system and software according to concrete use cases of these users. So far very good feedback Planning for Storage upgrade up to a total of about 3 PB until mid 2017 Option for extending the cluster also with “dedicated” resources (for paying customers), but within the same infrastructure and using centrally provided hardware choices
Data Policy Status Data Policy based on PaNdata framework Policy has been adopted by Directorate in October 2016 Policy applies to not only to the large research facilities at PSI, but to all research activities Embargo period 3 years, with easy extension to 5 years Implementation will be a long term effort, stepwise implementation per facility and beamline.
Remote Access Usecases: online and offline analysis, remote measurements, shift operation,sharing of sessions for support tasks , Sharing of sessions for collaboration Support for 3D Hardware Acceleration Access to the beamlines and to the DaaS Cluster through a common gateway Architecture based on separation of “server” and “node” processes of the Nomachine Software Version 5 Added graphical management tool to define (time based) access to beamline resources and offline compute cluster, with role based delegated management to resource responsible
Data Catalogue Decision for approach based on NoSQL document databases (MongoDB), taking advantage of recent developments for middleware (Loopback) and component based graphical user interfaces (Angular2) Need to cover extended set of use cases and long term evolution at PSI and ESS and therefore flexibility of a solution is mission critical Currently preparing the production environment for data ingestion and recruiting developer position(3 years 2017-2020). First 3 beamlines should be connected within DaaS project timeframe within Spring 2017 This is also a decision for continued collaboration with the ICAT community ! E.g. working on a common API to aim for interoperability of current and future products Evaluate potential to develop software (components) which can be used in a ”product” independent fashion Open to further suggestions…
Interactive and Batch data Analysis Support for doing interactive (e.g. Matlab) data analysis on the cluster, nodes can be reserved for interactive work. Standard Batch processing based on Slurm
Petabyte Archive PSI must prepare for the archiving of high amounts of data being expected for SLS and SwissFEL over the next decades. Strategic collaboration of PSI with the Swiss National Supercomputing Center (CSCS) in Lugano for building a Petabyte Tape Archive solution at CSCS Project initiated by a PoC within the DaaS project Volume increase driven by detector and instrumentation advances. Planning to leverage IBM Spectrum Scale (GPFS) AFM technology for the asynchronous data transfers between the sites. Dataflow orchestration and packaging tools are being evaluated, elected candidate is Arema from IBM Definition of interfaces from and to data catalogue ongoing
Remote Data Transfer Support for rsync/scp and gridftp (Globus Online) Also evaluated Aspera solution from IBM. Could be added, but only if (paying) customer would request for it The integration with the longterm archive will create additional requirements
Wir schaffen Wissen – heute für morgen My thanks go to Stephan Egli Derek Feichtinger Gerd Mann