DaaS and Kubernetes at PSI

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Refining High Performance FORTRAN Code from Programming Model Dependencies Ferosh Jacob University of Alabama Department of Computer Science
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.
© Chinese University, CSE Dept. Software Engineering / Software Engineering Topic 1: Software Engineering: A Preview Your Name: ____________________.
HPC Pack On-Premises On-premises clusters Ability to scale to reduce runtimes Job scheduling and mgmt via head node Reliability HPC Pack Hybrid.
Summary Role of Software (1 slide) ARCS Software Architecture (4 slides) SNS -- Caltech Interactions (3 slides)
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.
VAP What is a Virtual Application ? A virtual application is an application that has been optimized to run on virtual infrastructure. The application software.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
1 1 Hybrid Cloud Solutions (Private with Public Burst) Accelerate and Orchestrate Enterprise Applications.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
GumTree Feature Overview Tony Lam Data Acquisition Team Bragg Institute eScience Workshop 2006.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
An Approach To Automate a Process of Detecting Unauthorised Accesses M. Chmielewski, A. Gowdiak, N. Meyer, T. Ostwald, M. Stroiński
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GAAIN Virtual Appliances: Virtual Machine Technology for Scientific Data Analysis Arihant Patawari USC Stevens Neuroimaging and Informatics Institute July.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
U.S. Department of the Interior U.S. Geological Survey Decision Support Tools and USGS Data Management Best Practices Cassandra Ladino USGS Chesapeake.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Copyright © Univa Corporation, All Rights Reserved Using Containers for HPC Workloads HEPiX – Apr 21, 2016 Fritz Ferstl – CTO, Univa.
Structured Container Delivery Oscar Renalias Accenture Container Lead (NOTE: PASTE IN PORTRAIT AND SEND BEHIND FOREGROUND GRAPHIC FOR CROP)
Canadian Bioinformatics Workshops
Clouding with Microsoft Azure
Accessing the VI-SEEM infrastructure
Organizations Are Embracing New Opportunities
CI Updates and Planning Discussion
Clouds , Grids and Clusters
Dockerize OpenEdge Srinivasa Rao Nalla.
Budget JRA2 Beneficiaries Description TOT Costs incl travel
Joslynn Lee – Data Science Educator
Working With Azure Batch AI
GWE Core Grid Wizard Enterprise (
Joseph JaJa, Mike Smorul, and Sangchul Song
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
Configuration Management with Azure Automation DSC
An easier path? Customizing a “Global Solution”
CernVM Status Report Predrag Buncic (CERN/PH-SFT).
Introduction to Cloud Computing
PROCESS - H2020 Project Work Package WP6 JRA3
Mirjam van Daalen, (Stephan Egli, Derek Feichtinger) :: Paul Scherrer Institut Status Report PSI PaNDaaS2 meeting Grenoble 6 – 7 July 2016.
Grid Canada Testbed using HEP applications
Chapter 2: System Structures
ExaO: Software Defined Data Distribution for Exascale Sciences
Selling IIoT Solutions to Systems Integrators
SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION
Module 01 ETICS Overview ETICS Online Tutorials
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Building and running HPC apps in Windows Azure
Mirjam van Daalen, (Stephan Egli, Derek Feichtinger) :: Paul Scherrer Institut Status Report PSI PaNDaaS2 meeting Grenoble 12 – 13 December 2016.
Overview of Workflows: Why Use Them?
MMG: from proof-of-concept to production services at scale
Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing
OpenStack Summit Berlin – November 14, 2018
Salesforce.com Salesforce.com is the world leader in on-demand customer relationship management (CRM) services Manages sales, marketing, customer service,
DBOS DecisionBrain Optimization Server
Status JRA2 WP24 Demonstrator of a Photon Science Analysis Service (DaaS) Mirjam van Daalen 6/28/2019 Mirjam van Daalen PSI.
Mark Quirk Head of Technology Developer & Platform Group
ONAP Architecture Principle Review
Containers on Azure Peter Lasne Sr. Software Development Engineer
Presentation transcript:

DaaS and Kubernetes at PSI Stephan Egli :: Paul Scherrer Institut :: Photon Science Department DaaS and Kubernetes at PSI CALIPSOplus JRA2 Meeting, May 23rd 2018

Experiences gained at PSI Purpose of this talk: What existing solutions do we have so far ? What are new options that we might explore more in future ? Intention: provide input for discussion: how to merge the best ideas and experiences gained at our different sites ? which building blocks should be part of a new blueprint ? Illustrate need for extensive tests and explorations of options to be able to make the right decisions in time and therefore important to compare each other experiences Disclaimer: I just summarize the situation. All errors and omissions are my fault. The results achieved are all due to the long term commitment and the tremendous efforts invested by the colleagues from the Science IT department of PSI and colleagues from the ESS /Data Archive project !

Data Analysis as a Service Project See Webpage: https://www.psi.ch/photon-science-data-services/offline-computing-facility-for-sls-and-swissfel-data-analysis Main goal: make offline data analysis of large datasets easier for researchers Needs sophisticated and high performance storage infrastructure with Good connectivity to online systems Good Software environments and support Current typical usage: between 30000 and 60000 CPU hours per group and month for the main user groups cSaxs, Tomcat, MX and Swissfel

DaaS Infrastructure Overview

Online-Offline connectivity SLS

Spectrum Scale GPFS Active File Management

Software Environment, Expert Support Provide standard software for interactive analysis and visualization, like Matlab, Mathematica, iPython environments etc in different versions as well as domain specific Extended environment module system to mitigate the problem of providing different SW (p-modules) version and development environments to different researchers and for different architectures Provide ready-to-use scientific software packages - e.g. MX: solve protein structures from the SLS and FEL data, collected using both conventional methods (rotating sample) and serial crystallography methods. Provide SW development environments to allow researchers to build, develop and refine the scientific codes Provide support for different compiler chains (gcc, intel, OpenMP, MPI, Cuda) Provide help to scientists in tuning the algorithms and optimizing them. Often gives the largest overall performance boost. This needs local experts knowledgable both in science and IT algorithms and code optimizations, e.g. running parallelized ptychographic reconstruction codes. Jupyter Notebooks for web based interactive work.

Interactive analysis with Jupyter Notebooks Environment Cluster Queues

Further components in use Batch system SLURM as batch scheduler. Resource management done by integrating with Linux c-groups Remote access via NoMachine, classical GUI login, users see all the same environment and then work either interactively or submit batch jobs. Remote data transfer: Globus Online (gridFTP based), rsync for special use cases. Integration with data catalog and archive system (see later)

Data Catalog and Container Orchestration Data catalog is an important component for the overall data management life-cycle Gateway to the archive system for long term storage Necessary component to implement the data policy Challenge: integration into existing and historically grownenvironments demands a flexible framework. We use the SciCat data catalog, see https://github.com/SciCatProject Architecture based on microservices which are very well suited to run in containers Needs a container orchestration platform. We chose Kubernetes. Experience with Kubernetes is very good, both in terms of functionality and operational stability. Initially built for long running web service type applications Persistency layer implemented via MongoDB

Overall Data Catalog Architecture

Kubernetes Dashboard: Overview over all test and prod environments

Single Pod Infos

Beamline Ingestors based on Node-Red

Data catalog GUI, user view

Scientific Metadata View

Access to Datacatalog via OpenAPI REST API

Containers for Data Analysis Disclaimer: only minimal own experience so far Potential advantage: adaptability to existing environments at different sites containers allow to provide OS environments tailored to the needs of the different scientist groups containers make it easier to share full work environments New Container implementations for better HPC support Shifter-NG: Linux containers for HPC (NERSC, CSCS, tested within HEP application together with Science IT): Allows an HPC system to efficiently and safely allow end-users to run a docker image. Integration with batch scheduler systems. Security oriented to HPC systems, native performance of custom HPC hardware. Compatible with Docker. Singularity: Mobility of Compute, see http://singularity.lbl.gov/. Leverage resources like HPC interconnects, resource managers, file systems, GPUs and/or accelerators, etc.

Kubernetes for Data Analysis ? Originally main application area: (long running) web services Can be exploited for Jupyter notebooks (ready to use helm charts) Meanwhile concepts for Kubernetes extended: Jobs/Batch Ressources Ideas for Integration with Shifter/Singularity type containers Kubernetes: (OCI compliant runtimes): https://www.sylabs.io/2018/03/singularity-oci-cloud-native-enterprise-performance-computing/ Remark: Kubernetes also planned to be used by the Controls colleagues for machine and beamline control system infrastructure

Some Open points and Questions If we make use of container technology in a HPC/HTC environment: which container image type(s) to use ? Should it be Docker compatible in any case ? How to overcome Docker limitations: docker's main design goal is to provide completely independent container images, while a HPC cluster always is built on the sharing of some specially efficient HW components. Inefficiency on parallel filesystems due to its stacked container format ? how do we handle storage resources efficiently for HTC applications ? This implies integration of parallel FS, network performance and security aspects how do we manage resources (batch systems vs container orchestration, HPC Cluster vs “Cloud”) , see e.g. https://kubernetes.io/blog/2017/08/kubernetes-meets-high-performance/ . Choose one or the other or both merged in some way ? Do containers make the virtualization layer unnecessary ? Or do we still need it e.g. for optimal reproducability ?

Tools (in use) at other sites CERN: for HEP use cases Reusable Analysis platform REANA/RECAST http://github.com/reanahub , http://github.com/recast-hep : Workflow Engine where each step is a Kubernetes Job HTCondor and Docker/Kubernetes: https://zenodo.org/record/1042367/files/clenimar-report-cern-final.pdf SDSC/EPFL: Renga - http://renga.readthedocs.io/en/latest/ Securely manage, share and process large-scale data across untrusted parties operating in a federated environment. Automatically capture complete lineage up to original raw data for detailed traceability, auditability & reproducibility. Aiida http://www.aiida.net/ :Automated Interactive Infrastructure and Database for Computational Science and Materials Cloud: https://www.materialscloud.org/home : A platform for open science

Summary This is just a sketch of the situation as far as I am aware of it There are a lot of interesting developments currently ongoing The whole topic is WIP, constantly moving and adapting The future path(es) still need to be explored by all of us and it will help to share our experiences Finding good solutions and at the same time minimizing the risks is favoring an iterative approach and resources/willingness to test, implement (or abandon) solutions

Acknowledgements Thanks go the help from all colleagues in the IT department involved, in particular the Science IT and the colleagues from the ESS project