Building an Elastic Batch System with Private and Public Clouds

Slides:



Advertisements
Similar presentations
University of Notre Dame
Advertisements

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
Cloud Computing Imranul Hoque. Today’s Cloud Computing.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Amazon EC2 Quick Start adapted from EC2_GetStarted.html.
Eucalyptus on FutureGrid: A case for Eucalyptus 3 Sharif Islam, Javier Diaz, Geoffrey Fox Gregor von Laszewski Indiana University.
Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.
Cloud Usage Overview The IBM SmartCloud Enterprise infrastructure provides an API and a GUI to the users. This is being used by the CloudBroker Platform.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.
Image Management and Rain on FutureGrid: A practical Example Presented by Javier Diaz, Fugang Wang, Gregor von Laszewski.
Status of StoRM+Lustre and Multi-VO Support YAN Tian Distributed Computing Group Meeting Oct. 14, 2014.
WNoDeS – Worker Nodes on Demand Service on EMI2 WNoDeS – Worker Nodes on Demand Service on EMI2 Local batch jobs can be run on both real and virtual execution.
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
WSV207. Cluster Public Cloud Servers On-Premises Servers Desktop Workstations Application Logic.
CoprHD and OpenStack Ideas for future.
Geant4 Activities in Japan Some news from Takashi Sasaki, Koichi Murakami, Akinori Kimura and colleagues.
OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.
Mobile Analyzer A Distributed Computing Platform Juho Karppinen Helsinki Institute of Physics Technology Program May 23th, 2002 Mobile.
CCJ introduction RIKEN Nishina Center Kohei Shoji.
Building Virtual Scientific Computing Environment with Openstack Yaodong Cheng, CC-IHEP, CAS ISGC 2015.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Grant.
Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
GRID & Parallel Processing Koichi Murakami11 th Geant4 Collaboration Workshop / LIP - Lisboa (10-14/Oct./2006) 1 GRID-related activity in Japan Go Iwai,
September 26, 2003K User's Meeting1 CCJ Usage for Belle Monte Carlo production and analysis –CPU time: 170K hours (Aug.1, 02 ~ Aug.22, 03)
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Advanced Computing Facility Introduction
Compute and Storage For the Farm at Jlab
Wataru Takase, Tomoaki Nakamura, Yoshiyuki Watase, Takashi Sasaki
OpenStack.
Accessing the VI-SEEM infrastructure
Dynamic Extension of the INFN Tier-1 on external resources
Organizations Are Embracing New Opportunities
Grid Computing: Running your Jobs around the World
Simulation Production System
The advances in IHEP Cloud facility
Elastic Computing Resource Management Based on HTCondor
Blueprint of Persistent Infrastructure as a Service
Dag Toppe Larsen UiB/CERN CERN,
Belle II Physics Analysis Center at TIFR
Working With Azure Batch AI
Dag Toppe Larsen UiB/CERN CERN,
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Provisioning 160,000 cores with HEPCloud at SC17
Logo here Module 3 Microsoft Azure Web App. Logo here Module Overview Introduction to App Service Overview of Web Apps Hosting Web Applications in Azure.
Building and Running an Enterprise-grade Serverless Platform on Kubernetes Quinton Hoole, Technical VP Ying Huang, Sr. Architect US R&D, Huawei.
PES Lessons learned from large scale LSF scalability tests
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
FCT Follow-up Meeting 31 March, 2017 Fernando Meireles
AWS COURSE DEMO BY PROFESSIONAL-GURU. Amazon History Ladder & Offering.
TYPES OFF OPERATING SYSTEM
Integration of Singularity With Makeflow
OpenStack Ani Bicaku 18/04/ © (SG)² Konsortium.
Haiyan Meng and Douglas Thain
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
OpenStack-alapú privát felhő üzemeltetés
Microsoft Virtual Academy
 YongPyong-High Jan We appreciate that you give an opportunity to have this talk. Our Belle II computing group would like to report on.
Different types of Linux installation
Gridifying the LHCb Monte Carlo production system
5 Azure Services Every .NET Developer Needs to Know
MMG: from proof-of-concept to production services at scale
Distributing META-pipe on ELIXIR compute resources
Job Submission Via File Transfer
Presentation transcript:

Building an Elastic Batch System with Private and Public Clouds Wataru Takase, Tomoaki Nakamura, Koichi Murakami, Takashi Sasaki Computing Research Center, KEK, Japan International Symposium on Grids & Clouds 2019

Projects in KEK Electron Accelerator Proton Accelerator Tokai Tsukuba Belle II (e-, e+ collision) Photon Factory T2K (Neutrino experiment) Hadron experiment MLF (Material and Life science) Credit KEK

Interactive work and job submission KEK Batch System Used by 14 Projects, 1200 users 10000 CPU cores Scientific Linux 6 IBM Spectrum LSF Batch service Job queues calc. server calc. server job job job Interactive work and job submission calc. server calc. server job job job job job job calc. server calc. server … work server Remote login LSF calc. server calc. server work server Batch job scheduler … … work server calc. server calc. server

Challenges for the Batch System: Piled up Waiting Jobs Available Job Slots: 10000 Limited by Number of CPU cores At the time of congestion, user jobs make a long stay in a job queue 2018/9/1 – 2018/9/30 2018/9/1 2018/9/30

Challenges for the Batch System: Request on Custom Environments Requirements on specific systems from experiments groups Develop an application on the other OS. Test for newer OS/Libraries. Stick to old OS Take advantage of Cloud computing Expand computing resource to clouds Resolve piled up jobs problem Provide heterogeneous clusters Resolve various requests on custom environments

Overview of Cloud-integrated Batch Job System Use cloud resources via batch job submission command. $ bsub –q aws /bin/hostname On-premise resource SL6 cluster LSF OpenStack Resource Connector[1] AWS The other cloud Queue based resource selection Off-premise resource [1] https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_welcome/lsf_kc_resource_connector.html

Integration with OpenStack Batch service Physical machines (SL6) calc. server calc. server Dispatch normal job OpenStack Project manager 1. Create image Base image Custom image 4. Launch instance 3. Submit job LSF Resource Connector calc. server (VM) End user 5. Dispatch { "Name": "CentOS7_01",   "Attributes": {   "type": ["String", "X86_64"],     "openstackhost": ["Numeric", "1"], "template": ["CentOS7_01"] },   "Image": "generic-cent7-01",   "Flavor": "c04-m016G" } 2. Create Resource connector template Cloud admin

Integration with Existing System: LDAP LDAP authentication is used for the cluster. Use the LDAP as OpenStack authentication backend. Use the LDAP for Linux accounts inside of VMs. Keystone domains for multiple backends Nova default Service accounts DB Neutron Glance LDAP ldap User

Share GPFS between Local Batch and OpenStack Each compute node mounts GPFS and exposes the directories to VM via NFS. OpenStack Batch service calc. server calc. server calc. server (VM) calc. server (VM) calc. server (VM) … NFS mount GPFS Compute node NFS GPFS mount

… … Integration with AWS Launched on demand EC2 Launched on demand Filesystem is not shared with KEK batch system KEK NFS LSF calc. server … AWS queue LSF calc. server work server LSF VPN connection OpenStack queue OpenStack S3 Object storage LSF calc. server … The other queues LSF calc. server For sharing input/output data between KEK and AWS Physical machines (SL6)

Use AWS S3 Object Storage for Sharing Data between KEK and AWS KEK batch system and OpenStack share GPFS filesystem in KEK. AWS environment is independent from the KEK system. S3FS[3] or goofys[4] allows to Linux to mount an AWS S3 bucket via FUSE. KEK AWS 2. Copy input data NFS calc. server S3 bucket 3. Submit job INPUT work server LSF calc. server INPUT OUTPUT OUTPUT … 4. Copy output data 1. Put input data 5. Get output data [3] https://github.com/s3fs-fuse/s3fs-fuse [4] https://github.com/kahing/goofys

Upload/Download Speed Comparison between S3FS and Goofys Measured cp command execution time 1MB x 1000 files, 10 MB x 100, 100 MB x 10, 1000 MB x 1 $ cp –r /local/1mb_files_dir/ /s3fs/ $ cp –r /local/1mb_files_dir/ /goofys/ Goofys upload performance is better than S3FS. S3FS has more POSIX compatibility than Goofys.

Monitoring resource Transition on AWS Transition of total number of cores Submit jobs Number of instances on AWS Number of total cores on AWS

Scalability Test: Run Geant4 based Particle Therapy Monte Carlo Simulation Jobs on AWS Particle beam direction Treatment head with patient data obtained from CT images Simulated dose distribution Monte Carlo simulation shoots 2,000,000 Protons in total on N CPU cores If N=10, 10 CPU cores carried out simulation events 200,000 times each

Scalability Test: Run Geant4 based Particle Therapy Monte Carlo Simulation Jobs on AWS Scalability comparison between on KEK and AWS NFS leads to degrading the performance AWS KEK The AWS result has the same tendency as the KEK’s one.

Scalability Test: Image Classification by Deep Learning on AWS Classify CIFAR-10 image[5] into 10 categories. We have built Convolutional Neural Network, then trained for the classification using TensorFlow[6]. Convolution Neural Network conv1 layer pool1 layer conv2 layer pool2 layer FC1 layer FC2 layer auto- mobile Feedback [5] https://www.cs.toronto.edu/~kriz/cifar.html [6] https://www.tensorflow.org/tutorials/deep_cnn

Scalability Test: Image Classification Multi-node Deep Learning on AWS 23,000 sec (6.5 hours) 1 worker (64 cores) 57 workers (3648 cores) TensorFlow Cluster Traffic congestion? 1,000 sec Parameter server Store and update parameters 30 workers (1920 cores) Worker Worker Worker Calculate loss Submit TensorFlow jobs to AWS queue and measured scalability by changing number of workers.

Another Use case: Automatic Offloading to Cloud Submit 3000 jobs to the mixed-resources (KEK and AWS) queue Time 4. Some jobs dispatched to AWS servers PEND RUN PEND RUN 3. Some jobs dispatched to KEK servers Each job status Find free resource on KEK 2. Launch AWS instances, and some jobs dispatched to the AWS instances PEND RUN No more free resource on KEK RUN 1. Some jobs dispatched to KEK servers

Summary We have succeeded to integrate OpenStack and AWS clouds with LSF batch job system by using Resource Connector. Expands computing resources to clouds for reducing turnaround times of jobs at the time of congestion. Provides any kind of job processing environments by choosing a different instance image. The Monte Carlo simulation worked well on AWS with a bit of performance degradation due to NFS. The Deep Learning training speed performance on AWS scaled well up to about 2000 CPU cores. We have succeeded to offload some batch workloads to the AWS cloud automatically. Cloud resources used in this work was provided in the Demonstration Experiment of Cloud Use conducted by National Institute of Informatics (NII) Japan (FY2017).