Accelerated Computing in EGI

Slides:



Advertisements
Similar presentations
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
Advertisements

Profit from the cloud TM Parallels Dynamic Infrastructure AndOpenStack.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
European Organization for Nuclear Research Virtualization Review and Discussion Omer Khalid 17 th June 2010.
Virtualization for Cloud Computing
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Tanenbaum 8.3 See references
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Appendix B Planning a Virtualization Strategy for Exchange Server 2010.
Benefits: Increased server utilization Reduced IT TCO Improved IT agility.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
European Grid Initiative Federated Cloud update Peter solagna Pre-GDB Workshop 10/11/
WNoDeS – Worker Nodes on Demand Service on EMI2 WNoDeS – Worker Nodes on Demand Service on EMI2 Local batch jobs can be run on both real and virtual execution.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
GAAIN Virtual Appliances: Virtual Machine Technology for Scientific Data Analysis Arihant Patawari USC Stevens Neuroimaging and Informatics Institute July.
Full and Para Virtualization
Docker and Container Technology
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Software Systems Division (TEC-SW) ASSERT process & toolchain Maxime Perrotin, ESA.
User Interface UI TP: UI User Interface installation & configuration.
Instituto de Biocomputación y Física de Sistemas Complejos Cloud resources and BIFI activities in JRA2 Reunión JRU Española.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
Farming Andrea Chierici CNAF Review Current situation.
EGI-InSPIRE RI EGI-InSPIRE RI EGI-InSPIRE Software provisioning and HTC Solution Peter Solagna Senior Operations Manager.
EGI-Engage is co-funded by the Horizon 2020 Framework Programme of the European Union under grant number Federated Cloud Update.
Martin Kruliš by Martin Kruliš (v1.1)1.
1 EGI Federated Cloud Architecture Matteo Turilli Senior Research Associate, OeRC, University of Oxford Chair – EGI Federated Clouds Task Force
Advanced Computing Facility Introduction
PaaS services for Computing and Storage
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
Volunteer Computing and BOINC
Virtualization for Cloud Computing
GPGPU use cases from the MoBrain community
Intel Many Integrated Cores Architecture
NFV Compute Acceleration APIs and Evaluation
Volunteer Computing for Science Gateways
Current Generation Hypervisor Type 1 Type 2.
Dag Toppe Larsen UiB/CERN CERN,
StratusLab Roadmap C. Loomis (CNRS/LAL) EGI TCB (Amsterdam)
WP4/JRA2 Development of EGI Technical Platforms
Heterogeneous Computation Team HybriLIT
FedCloud Blueprint Update
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Alexandre M.J.J. Bonvin MoBrain CC coordinator
BDII Performance Tests
Running containers everywhere
Virtualization overview
FICEER 2017 Docker as a Solution for Data Confidentiality Issues in Learning Management System.
Containers in HPC By Raja.
An easier path? Customizing a “Global Solution”
Discussions on group meeting
Leveraging GPGPU Computing in Grid and Cloud Environments: First Results from the Mobrain CC Alexandre Bonvin Utrecht University
Conditions leading to the rise of virtual machines
Accelerated Computing in Cloud
Federated Accelerated Computing Platforms for EGI
Advanced Computing Facility Introduction
Orchestration & Container Management in EGI FedCloud
Chapter 1 Introduction.
Windows Virtual PC / Hyper-V
The Neuronix HPC Cluster:
Presentation transcript:

Accelerated Computing in EGI Viet Tran, Jan Astalos, Miroslav Dobrucky Institute of Informatics Slovak Academy of Sciences Marco Verlato, Paolo Andreetto, Lisa Zangrando, David Rebatto INFN

Accelerated computing Goal: implement the support in the information system, to expose the correct information about the accelerated computing technologies available – both software and hardware – at site level, developing a common extension of the information system structure, based on OGF GLUE standard… extend the HTC and cloud middleware support for co-processors, where needed, in order to provide a transparent and uniform way to allocate these resources together with CPU cores efficiently to the users Accelerators: GPGPU (General-Purpose computing on Graphical Processing Units) NVIDIA GPU/Tesla/GRID, AMD Radeon/FirePro, Intel HD Graphics,... Intel Many Integrated Core (MIC) Architecture Xeon Phi Coprocessor Specialized PCIe cards with accelerators DSP (Digital Signal Processors) FPGA (Field Programmable Gate Array)

Accelerated computing in HTC Starting from previous work of EGI Virtual Team (2012) and GPGPU Working Group (2013-2014) CREAM-CE is the most popular grid interface (Computing Element) to a number of LRMSes (Torque, LSF, Slurm, SGE, HTCondor) since many years in EGI Most recent versions of these LRMSes do support natively GPUs (and MIC cards), i.e. servers hosting these cards can be selected by specifying LRMS directives CREAM must be enabled to publish this information and support these directives

Implementation The CREAM GPU-enabled prototype was tested at 5 sites LRMSes: Torque, LSF, HTCondor, Slurm, and SGE LRMSes 3 new JDL attributes defined: GPUNumber, GPUModel, MICNumber At 3 sites the prototype is run in “production”: QMUL and ARNES (Slurm) and CIRMMP (Torque/Maui) New classes and attributes describing accelerators proposed and included in GLUE2.1 draft after discussion with the OGF WG A major release of CREAM is ready with GPU/MIC support for most LRMSes with the GLUE2.1 draft prototype as information system future official approval of GLUE 2.1 would occur after the specification is revised based on prototype lessons learned with new Puppet module for the CE site with support for GPU/MIC and GLUE 2.1 its CentOS7 version will be included in the next UMD 4.6.0 scheduled at mid November 2017

Example of submission to Slurm CE User job JDL: [ executable = "disvis.sh"; arguments = "10.0 2"; stdoutput = "out.txt"; stderror = "err.txt"; inputSandbox = { "disvis.sh" ,"O14250.pdb" , "Q9UT97.pdb" , "restraints.dat" }; outputsandboxbasedesturi = "gsiftp://localhost"; outputsandbox = { "out.txt" , "err.txt" , "results.tgz"}; GPUNumber=2; GPUModel="teslaK80"; ] Definitions in Slurm gres.conf and slurm.conf configuration files: NodeName=cn456 Name=gpu Type=teslaK40c File=/dev/nvidia0 NodeName=cn290 Name=gpu Type=teslaK80 File=/dev/nvidia[0-3] NodeName=cn456 CPUs=8 Gres=gpu:teslaK40c:1 RealMemory=11902 Sockets=1 CoresPerSocket=4… NodeName=cn290 CPUs=32 Gres=gpu:teslaK80:4 RealMemory=128935 Sockets=2 CoresPerSocket=8… On the worker node: $ lspci | grep NVIDIA 0a:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 0b:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 86:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 87:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) $ echo $CUDA_VISIBLE_DEVICES 0,1

Accelerated computing in Cloud For using GPGPU in clouds, the GPGPU must be supported in all layers including virtualization, CMF and higher tools Virtualization technologies KVM with PCI passthrough Widely used but with limitation Virtualized GPU is in a early stage: NVIDIA GRID vGPU (XenServer, VMWare hyperv. only) SR-IOV based AMD MxGPU (VMWare hyperv. only) Intel GVT-G recently added to Linux 4.10 kernel Container-based technologies LXD with GPU is promising But limited support in cloud middleware

IISAS-GPUCloud site Openstack GPGPU site at IISAS Site is integrated into EGI Federated Cloud and in production Hardware: IBM dx360 M4 servers with 2x Intel Xeon E5-2650v2 16 CPU cores, 64GB RAM, 1 TB storage on each WN 2x NVIDIA Tesla K20m on each WN Software Base OS: Ubuntu 16.04 LTS KVM hypervisor with PCI passthrough virtualisation of GPU cards OpenStack Mitaka middleware Newest Federated Cloud tools GPU-enabled machine types: gpu1cpu6 (1GPU + 6 CPU cores) gpu2cpu12 (2GPU +12 CPU cores)

Using IISAS-GPUCloud site Via rOCCI command-line client Simply choose GPU-enable flavor (e.g. gpu2cpu12) as resource template Or via Openstack Horizon portal Graphical interface Adding support for EGI users to login via token (no username/password)

IISAS-Nebula site OpenNebula site at IISAS Site is integrated into EGI Federated Cloud and in production Hardware: IBM dx360 M4 servers with 2x Intel Xeon E5-2650v2 16 CPU cores, 64GB RAM, 1 TB storage on each WN 2x NVIDIA Tesla K20m on each WN OpenNebula 5.0 Fully integrated to EGI FedCloud, certified Nearly same image list and flavors Access via rOCCI client

Docker support Dockers with GPGPU can be executed on GPU sites Create a VM with GPGPU-enable flavor and image Run docker with proper mapping to access GPU docker run --name=XXXXXX \ --device=/dev/nvidia0:/dev/nvidia0 \ --device=/dev/nvidia1:/dev/nvidia1 \ --device=/dev/nvidiactl:/dev/nvidiactl \ --device=/dev/nvidia-uvm:/dev/nvidia-uvm \ ….. INDIGO-DataCloud “udocker” tool (userspace container support) allows to execute the same GPGPU-enabled images on grid sites too (and is used in production by MoBrain CC) ./udocker run –-hostenv --volume=$WDIR:/home XXXXXX …

Info system: GLUE2.1 Draft ExecutionEnvironment class: represents a set of homogeneous WNs Is usually defined statically during the deployment of the service These WNs however can host different types/models of accelerators AcceleratorEnvironment class: represents a set of homogeneous accelerator devices Can be associated to one or more Execution Environments New attributes: PhysicalAccelerators Vendor Type Model Memory ClockSpeed Driver info are in the Application Environment GLUE2.1 draft as starting point for Accelerator-aware infoprovider Static info-provider based on GLUE2.1 AcceleratorEnvironment class New attributes: PhysicalAccelerators, Vendor, Type, Model, Memory, ClockSpeed Info can be obtained in Torque e.g. from pbsnodes command Driver info available in the ApplicationEnvironment

Information system GLUE 2.1 as starting point for Accelerator-aware info provider For HTC: New GLUE object, the Accelerator Environment, representing a set of homogeneous accelerator devices. New attributes and associations for existing objects such as Computing Manager, Computing Share and Execution Environment. For Cloud: Cloud Compute Virtual Accelerator New entity, the Cloud Compute Virtual Accelerator New attributes and associations for existing objects such as Cloud Compute Service

Info system: static info Example of GLUE2.1 static info publication: $ ldapsearch -x -LLL -h cegpu.cerm.unifi.it -p 2170 -b o=glue (objectClass=GLUE2AcceleratorEnvironment) GLUE2AcceleratorEnvironmentMemory: 5120 GLUE2AcceleratorEnvironmentID: tesla.cegpu.cerm.unifi.it GLUE2AcceleratorEnvironmentModel: Tesla K20m objectClass: GLUE2Entity objectClass: GLUE2AcceleratorEnvironment GLUE2EntityCreationTime: 2015-05-04T16:31:18Z GLUE2AcceleratorEnvironmentExecutionEnvironmentForeignKey: cegpu.cerm.unifi.it GLUE2AcceleratorEnvironmentVendor: NVIDIA GLUE2AcceleratorEnvironmentPhysicalAccelerators: 2 GLUE2AcceleratorEnvironmentType: GPU GLUE2EntityName: tesla.cegpu.cerm.unifi.it GLUE2AcceleratorEnvironmentLogicalAccelerators: 2 GLUE2AcceleratorEnvironmentClockSpeed: 706 $ ldapsearch -x -LLL -h cegpu.cerm.unifi.it -p 2170 -b o=glue (&(objectClass=GLUE2ApplicationEnvironment)(GLUE2EntityName=nvidia-driver)) GLUE2ApplicationEnvironmentAppName: nvidia-driver GLUE2ApplicationEnvironmentDescription: NVidia driver for CUDA GLUE2ApplicationEnvironmentExecutionEnvironmentForeignKey: cegpu.cerm.unifi.it GLUE2ApplicationEnvironmentID: nvidia-driver GLUE2ApplicationEnvironmentAppVersion: 352.93 GLUE2ApplicationEnvironmentComputingManagerForeignKey: cegpu.cerm.unifi.it_ComputingElement_Manager GLUE2EntityName: nvidia-driver $ ldapsearch -x -LLL -h cegpu.cerm.unifi.it -p 2170 -b o=glue (&(objectClass=GLUE2ExecutionEnvironment) (GLUE2EntityName=cegpu.cerm.unifi.it)) GLUE2ExecutionEnvironmentCPUModel: Xeon […] GLUE2ExecutionEnvironmentAcceleratorEnvironmentForeignKey: tesla.cegpu.cerm.unifi.it GLUE2ExecutionEnvironmentApplicationEnvironmentForeignKey: nvidia-driver

Accounting For clouds, extended accounting records with new items for describing accelerators: AcceleratorCount, AcceleratorProcessors, AcceleratorDuration, AcceleratorWallDuration, AcceleratorBenchmarkType, AcceleratorBenchmark, AcceleratorType Modification of accounting tools for supporting new accelerator-related items

Documentation, tutorial and user supports User tutorial: How to use GPGPU on IISAS-GPUCloud/IISAS-Nebula Access via rOCCI client Access via OpenStack dashboard with token How to create your own GPGPU server in cloud Site admin guide How to enable GPGPU passthrough in OpenStack Additional tools Automation via scripts: NVIDIA + CUDA installer Keystone-VOMS client for getting token Keystone-voms module for Openstack Horizon All this in a wiki: https://wiki.egi.eu/wiki/GPGPU-FedCloud https://wiki.egi.eu/wiki/GPGPU-OpenNebula

GPU specific VO acc-comp.egi.eu VO has been established for testing and development with GPGPU: VO image list with preinstalled GPU drivers and CUDA libraries are available via AppDB Supported only at sites with GPGPU hardware More info at https://wiki.egi.eu/wiki/Accelerated_computing_VO

Accelerated computing website

More information https://wiki.egi.eu/wiki/GPGPU-CREAM https://mobrain.egi.eu/technical https://wiki.egi.eu/wiki/GPGPU-FedCloud https://wiki.egi.eu/wiki/GPGPU-OpenNebula https://wiki.egi.eu/wiki/Accelerated_computing_VO https://accelerated.ui.sav.sk/?page_id=21 https://horizon.ui.savba.sk/

GROMACS GPU acceleration introduced in version v4.5 Grid portal runs it in multi-threading mode No significant cloud overhead measured for GPU speedups   Simulation performance in ns/day GPU Acceleration Dataset Protein size (aa) 1 core 4 cores 8 cores 1 core + GPU 4 cores + GPU 8 cores + GPU Villin 35 61 193 301 264 550 650 4.3x 2.8x 2.2x RNAse 126 19 64 105 87 185 257 4.6x 2.9x 2.4x ADH 1,408 3.1 11 18 13 38 50 4.2x 3.5x Ferritin 4,200 - 1.1 2.1 4.1 5.9 3.7x

AMBER Restrained (rMD) Energy Minimization on NMR Structures Free MD simulations of ferritin a) ~100x gain b) 94-150x gain 4.7x gain Dynamic power strongly reduced: 8% of the 64 core Opteron server

DisVis and PowerFit

DisVis and PowerFit on EGI Accelerated Platform Solution for grid and cloud computing: Docker containers built with proper libraries and OpenCL support: Application requirements: Docker engine not required on grid WNs: use udocker tool to run docker containers in user space (https://github.com/indigo-dc/udocker)

disvis.sh job example Driver identification Install udocker tool #!/bin/sh version=$(nvidia-smi | awk '/Driver Version/ {print $6}') export WDIR=`pwd` git clone https://github.com/indigo-dc/udocker cd udocker image=/cvmfs/wenmr.egi.eu/BCBR/DisVis/disvis-nvdrv_$version.tar [ -f $image ] && ./udocker load -i $image [ -f $image ] || ./udocker pull indigodatacloudapps/disvis:nvdrv_$version rnd=$RANDOM ./udocker create --name=disvis-$rnd indigodatacloudapps/disvis:nvdrv_$version mkdir $WDIR/out ./udocker run –-hostenv --volume=$WDIR:/home disvis-$rnd disvis \ /home/O14250.pdb /home/Q9UT97.pdb /home/restraints.dat -g -a $1 –vs $2 \ -d /home/out ./udocker.py rm disvis-$rnd ./udocker.py rmi indigodatacloudapps/disvis:nvdrv_$version cd $WDIR tar zcvf results.tgz out/ Install udocker tool Load/Pull DisVis image Create the container Run the container executing DisVis