Building an Elastic Batch System with Private and Public Clouds Wataru Takase, Tomoaki Nakamura, Koichi Murakami, Takashi Sasaki Computing Research Center, KEK, Japan International Symposium on Grids & Clouds 2019
Projects in KEK Electron Accelerator Proton Accelerator Tokai Tsukuba Belle II (e-, e+ collision) Photon Factory T2K (Neutrino experiment) Hadron experiment MLF (Material and Life science) Credit KEK
Interactive work and job submission KEK Batch System Used by 14 Projects, 1200 users 10000 CPU cores Scientific Linux 6 IBM Spectrum LSF Batch service Job queues calc. server calc. server job job job Interactive work and job submission calc. server calc. server job job job job job job calc. server calc. server … work server Remote login LSF calc. server calc. server work server Batch job scheduler … … work server calc. server calc. server
Challenges for the Batch System: Piled up Waiting Jobs Available Job Slots: 10000 Limited by Number of CPU cores At the time of congestion, user jobs make a long stay in a job queue 2018/9/1 – 2018/9/30 2018/9/1 2018/9/30
Challenges for the Batch System: Request on Custom Environments Requirements on specific systems from experiments groups Develop an application on the other OS. Test for newer OS/Libraries. Stick to old OS Take advantage of Cloud computing Expand computing resource to clouds Resolve piled up jobs problem Provide heterogeneous clusters Resolve various requests on custom environments
Overview of Cloud-integrated Batch Job System Use cloud resources via batch job submission command. $ bsub –q aws /bin/hostname On-premise resource SL6 cluster LSF OpenStack Resource Connector[1] AWS The other cloud Queue based resource selection Off-premise resource [1] https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_welcome/lsf_kc_resource_connector.html
Integration with OpenStack Batch service Physical machines (SL6) calc. server calc. server Dispatch normal job OpenStack Project manager 1. Create image Base image Custom image 4. Launch instance 3. Submit job LSF Resource Connector calc. server (VM) End user 5. Dispatch { "Name": "CentOS7_01", "Attributes": { "type": ["String", "X86_64"], "openstackhost": ["Numeric", "1"], "template": ["CentOS7_01"] }, "Image": "generic-cent7-01", "Flavor": "c04-m016G" } 2. Create Resource connector template Cloud admin
Integration with Existing System: LDAP LDAP authentication is used for the cluster. Use the LDAP as OpenStack authentication backend. Use the LDAP for Linux accounts inside of VMs. Keystone domains for multiple backends Nova default Service accounts DB Neutron Glance LDAP ldap User
Share GPFS between Local Batch and OpenStack Each compute node mounts GPFS and exposes the directories to VM via NFS. OpenStack Batch service calc. server calc. server calc. server (VM) calc. server (VM) calc. server (VM) … NFS mount GPFS Compute node NFS GPFS mount
… … Integration with AWS Launched on demand EC2 Launched on demand Filesystem is not shared with KEK batch system KEK NFS LSF calc. server … AWS queue LSF calc. server work server LSF VPN connection OpenStack queue OpenStack S3 Object storage LSF calc. server … The other queues LSF calc. server For sharing input/output data between KEK and AWS Physical machines (SL6)
Use AWS S3 Object Storage for Sharing Data between KEK and AWS KEK batch system and OpenStack share GPFS filesystem in KEK. AWS environment is independent from the KEK system. S3FS[3] or goofys[4] allows to Linux to mount an AWS S3 bucket via FUSE. KEK AWS 2. Copy input data NFS calc. server S3 bucket 3. Submit job INPUT work server LSF calc. server INPUT OUTPUT OUTPUT … 4. Copy output data 1. Put input data 5. Get output data [3] https://github.com/s3fs-fuse/s3fs-fuse [4] https://github.com/kahing/goofys
Upload/Download Speed Comparison between S3FS and Goofys Measured cp command execution time 1MB x 1000 files, 10 MB x 100, 100 MB x 10, 1000 MB x 1 $ cp –r /local/1mb_files_dir/ /s3fs/ $ cp –r /local/1mb_files_dir/ /goofys/ Goofys upload performance is better than S3FS. S3FS has more POSIX compatibility than Goofys.
Monitoring resource Transition on AWS Transition of total number of cores Submit jobs Number of instances on AWS Number of total cores on AWS
Scalability Test: Run Geant4 based Particle Therapy Monte Carlo Simulation Jobs on AWS Particle beam direction Treatment head with patient data obtained from CT images Simulated dose distribution Monte Carlo simulation shoots 2,000,000 Protons in total on N CPU cores If N=10, 10 CPU cores carried out simulation events 200,000 times each
Scalability Test: Run Geant4 based Particle Therapy Monte Carlo Simulation Jobs on AWS Scalability comparison between on KEK and AWS NFS leads to degrading the performance AWS KEK The AWS result has the same tendency as the KEK’s one.
Scalability Test: Image Classification by Deep Learning on AWS Classify CIFAR-10 image[5] into 10 categories. We have built Convolutional Neural Network, then trained for the classification using TensorFlow[6]. Convolution Neural Network conv1 layer pool1 layer conv2 layer pool2 layer FC1 layer FC2 layer auto- mobile Feedback [5] https://www.cs.toronto.edu/~kriz/cifar.html [6] https://www.tensorflow.org/tutorials/deep_cnn
Scalability Test: Image Classification Multi-node Deep Learning on AWS 23,000 sec (6.5 hours) 1 worker (64 cores) 57 workers (3648 cores) TensorFlow Cluster Traffic congestion? 1,000 sec Parameter server Store and update parameters 30 workers (1920 cores) Worker Worker Worker Calculate loss Submit TensorFlow jobs to AWS queue and measured scalability by changing number of workers.
Another Use case: Automatic Offloading to Cloud Submit 3000 jobs to the mixed-resources (KEK and AWS) queue Time 4. Some jobs dispatched to AWS servers PEND RUN PEND RUN 3. Some jobs dispatched to KEK servers Each job status Find free resource on KEK 2. Launch AWS instances, and some jobs dispatched to the AWS instances PEND RUN No more free resource on KEK RUN 1. Some jobs dispatched to KEK servers
Summary We have succeeded to integrate OpenStack and AWS clouds with LSF batch job system by using Resource Connector. Expands computing resources to clouds for reducing turnaround times of jobs at the time of congestion. Provides any kind of job processing environments by choosing a different instance image. The Monte Carlo simulation worked well on AWS with a bit of performance degradation due to NFS. The Deep Learning training speed performance on AWS scaled well up to about 2000 CPU cores. We have succeeded to offload some batch workloads to the AWS cloud automatically. Cloud resources used in this work was provided in the Demonstration Experiment of Cloud Use conducted by National Institute of Informatics (NII) Japan (FY2017).