Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University kumar.raju@husky.neu.edu.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing
What’s New: Windows Server 2012 R2 Tim Vander Kooi Systems Architect
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
Introduction to DBA.
“It’s going to take a month to get a proof of concept going.” “I know VMM, but don’t know how it works with SPF and the Portal” “I know Azure, but.
1 Week #1 Objectives Review clients, servers, and Windows network models Differentiate among the editions of Server 2008 Discuss the new Windows Server.
1 Week #1 Objectives Review clients, servers, and Windows network models Differentiate among the editions of Server 2008 Discuss the new Windows Server.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Virtual Machine approach to Security Gautam Prasad and Sudeep Pradhan 10/05/2010 CS 239 UCLA.
Robert Horn, Agfa Corporation
Hands-On Microsoft Windows Server 2008 Chapter 8 Managing Windows Server 2008 Network Services.
Virtualization A way To Begin with Virtual Reality… - Rahul Khanwani.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
EstiNet Network Simulator & Emulator 2014/06/ 尉遲仲涵.
Squiggle Lan Messenger.
Virtual Infrastructure in the Grid Kate Keahey Argonne National Laboratory.
A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
Submitted by: Shailendra Kumar Sharma 06EYTCS049.
Basic Concepts Of CITRIX XENAPP.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
CloudNaaS: A Cloud Networking Platform for Enterprise Applications Theophilus Benson*, Aditya Akella*, Anees Shaikh +, Sambit Sahu + (*University of Wisconsin,
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 4: Planning and Configuring Routing and Switching.
Introduction to Active Directory
Turn Bare Metal Into Silver Lining With SCVMM 2012, Today! Mark Rhodes OBS SESSION CODE: SEC313 (c) 2011 Microsoft. All rights reserved.
Planning Server Deployments Chapter 1. Server Deployment When planning a server deployment for a large enterprise network, the operating system edition.
Information Initiative Center, Hokkaido University North 11, West 5, Sapporo , Japan Tel, Fax: General.
Workspace Management Services Kate Keahey Argonne National Laboratory.
Communication Needs in Agile Computing Environments Michael Ernst, BNL ATLAS Distributed Computing Technical Interchange Meeting University of Tokyo May.
Intro To Virtualization Mohammed Morsi
BY: SALMAN 1.
Windows 2012R2 Hyper-V and System Center 2012
Md Baitul Al Sadi, Isaac J. Cushman, Lei Chen, Rami J. Haddad
BY: SALMAN.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Network Operating System Lab
Set up your own Cloud The search for a secure and acceptable means of gaining access to your files stored at the office from a remote location.
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
Logo here Module 8 Implementing and managing Azure networking 1.
Daniel Murphy-Olson Ryan Aydelott1
Chapter 4 Data Link Layer Switching
Chapter 6 Delivery & Forwarding of IP Packets
Oracle Solaris Zones Study Purpose Only
VMDIRAC status Vanessa HAMAR CC-IN2P3.
Red Hat User Group June 2014 Marco Berube, Cloud Solutions Architect
Introduction to Cloud Computing
Exploring Azure Event Grid
GGF15 – Grids and Network Virtualization
Management of Virtual Execution Environments 3 June 2008
Real IBM C exam questions and answers
Advanced Network Training
Dynamic DNS support for EGI Federated cloud
Dr. John P. Abraham Professor, Computer Engineering UTPA
Dev Test on Windows Azure Solution in a Box
Network Virtualization
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Data Security for Microsoft Azure
HC Hyper-V Module GUI Portal VPS Templates Web Console
Chapter 2: The Linux System Part 1
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 4: Planning and Configuring Routing and Switching.
Cloud computing mechanisms
Specialized Cloud Architectures
Creating a Dynamic HPC Infrastructure with Platform Computing
Windows Server Administration Fundamentals
Presentation transcript:

Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University kumar.raju@husky.neu.edu Evan Weinberg, Boston University weinbe2@bu.edu Chris Hill, Massachusetts Institute of Technology cnh@mit.edu

Goal Grow and shrink the cluster dynamically Hold the state of the jobs processed to save the invested computing cycles Run multiple jobs on same resources based on their priority adding or removing nodes on the fly based on some trigger jobs arrives, provision resources Don’t loose the compute cycles already invested on the job So to move a job away from a set of resources instead of killing or requeing the job, hold the state and continue from that later run multiple jobs driven by implicit or explicit priority of the jobs Implicit – actual workload over hpc jobs explicit- diff partitions or OSG Eg of low priority job running and moving out on high priority

Approach to proof of concept OpenStack cloud to spawn VM as compute nodes Suspend and resume feature of the VM to hold the state of the jobs Open Science Grid jobs as the workload Single core Self-contained, high throughput Doesn’t require explicit high performance resources Use components that modify the existing system as least as possible We have built a proof of concept around this and, our approach was to Use openstack cloud to spawn VM that will be configured to act as compute nodes of the slurm cluster and use the properties of VM to hold and restore the state of jobs To test the real workload, we’ll be using the single node self contained open science grid jobs that doesn’t need high performing resources and we’ll used other supporting components that were either in line with the existing cluster or required least modifications to existing setup and provided a path of least resilience 3

Existing Slurm cluster Controller Compute So lets say we have and existing slurm cluster with a controller and multiple compute nodes And an openstack cloud subnet to provision VM and extend this cluster onto These are two independent clusters on isolated networks that may be within same data centers or different ones Existing Slurm Cluster OpenStack Cloud subnet Existing Slurm cluster with controller and compute nodes OpenStack cloud subnet to extend the existing cluster Isolated clusters may or may not reside within the same datacenter

Existing Slurm cluster Controller NFS Compute These clusters generally have a shared file system to share the job processing bits like datsets, logs, results etc across the node Which is hosted by a NFS. We could use this as the jobs we’ll be running are self sufficient and doesn’t depend on high performance file systems Existing Slurm Cluster OpenStack Cloud subnet Shared file system across the cluster using NFS to share the relevant bits for job processing

Existing Slurm cluster Controller NFS Compute DNS Along with this we have a dns server to locate the nodes in the cluster on hostname This is an existing slurm cluster. N So what all we need to make this cluster dynamic Existing Slurm Cluster OpenStack Cloud subnet DNS server to locate the nodes in the cluster

Monitor job queue Monitor daemon listens for a trigger to spawn nodes Controller Compute NFS Existing Slurm Cluster OpenStack Cloud subnet DNS To make this cluster dynamic, we have developed a monitor That waits for some event which is configured to trigger the dynamic provisioning of the nodes and prioritized execution of the jobs In this case, it listens to the job queues for the jobs that are eligible to run the virtual compute nodes Monitor daemon listens for a trigger to spawn nodes (job queue for the specific workload/OSG jobs)

Dynamic resource provisioning Monitor Controller Provisioning Compute NFS Existing Slurm Cluster Compute OpenStack Cloud subnet DNS If the job arrives in the queue, Monitor calls some provisioning agent that spawns a VM. Here we have used salt-cloud, a cloud provisioning agent from a configuration management tool Salt. Used salt as its being used by the existing cluster and so we could leverage that. Else cloud be any tool that does this job Salt-cloud connect to oepnstack, Provisions a VM And installs the salt agent so that it could later be configured using salt Monitors calls the provisioning agent(Salt-cloud) to spawn the node/VM if a job comes in

Channel between clusters Monitor Controller Provisioning Compute NFS Existing Slurm Cluster Compute OpenStack Cloud subnet DNS Nodes needs some channel to go across provides a secured channel , supports diff protocols and allow access using private nw address Gateway nodes with floating IP to run openvpn service Route the traffic via gateway Common channel…easy to configure and maintain than point to point connectivity Gateway Gateway OpenVPN OpenVPN provides a secured channel to augment two networks Route the traffic for the other cluster to the gateway node 9

Configure new compute node Monitor Controller Config tool Config files Compute NFS Compute Existing Slurm Cluster Hosts OpenStack Cloud subnet DNS Salt configures the node Connects to nfs fs from where necessary config files are pulled Updates the hosts file with the network address Used by dns to have the location of the nodes Which used by controller to identify the new node on config reload Gateway Gateway OpenVPN Configuration tool(Salt) configures the node for Slurm Pulls necessary configuration files from NFS Updates the hosts file with IP address of the new node for DNS

Running low priority job Monitor Controller Provision Config tool Compute NFS Compute Existing Slurm Cluster OpenStack Cloud subnet DNS OSG Jobs are allocated to the newly federated node If a high priority job comes in Eligible to use the same resources used by the compute node running the OSG job Hav already invested some compute cycles…so save the state rather than kill or requeue the job Gateway Gateway OpenVPN Low priority(OSG) job is running on the newly federated compute node

Suspend the VM for high priority job Monitor Controller Provision Config tool Compute NFS Compute Existing Slurm Cluster OpenStack Cloud subnet DNS Susppend job and update state Suspend the node on OS Persist vm state KVM releases the resources Leverage this to spwan another vm on same resorurces Gateway Gateway OpenVPN Suspend the job via Slurm; running on the compute node Call OpenStack to suspend the VM Get the resources for higher priority job

Running high priority job Monitor Controller Provision Config tool Compute NFS Compute Compute Existing Slurm Cluster OpenStack Cloud subnet DNS Spwan a new VM by the same process utilizing the same resources used by previous VM High priority job run on new VM Monitor keeps a check on the state of the job for completion Gateway Gateway OpenVPN Spawns a new VM as slurm compute Configures it and runs the high priority job on it

Resume the VM/job Monitor Controller Provision Config tool NFS Compute Compute Existing Slurm Cluster DNS OpenStack Cloud subnet NTP Removes the high priority VM Resumes the suspended VM VM start with the time that was at the point when it was last suspended Causes time diff Syncs the clock with NTP Stay in sync with the cluster to avoid token expiry and auth issues Shows dynamically provisioning a node, running the job on it and saving the state when higher priority job arrives and resume the job Gateway Gateway OpenVPN On high priority job completion, removes the priority VM Resumes the suspended VM Sync system clock with NTP server to avoid token expiration

Conclusion and next steps Implemented in a staging environment Actively Extending this to existing Slurm cluster on Engage1 Modify Slurm to incorporate essential features, transfer mods to Slurm core So where are we with this

Future works More extensive monitor capabilities Harden and utilize full data center performance (hardware, network etc.) Enhance the implementation for more complex high performing applications Extend the idea to physical nodes Experiment with container frameworks More triggers to drive the cluster Better driving algo to move resources Currently runs single node compute jobs that focuse on throughput than peroformance

Conclusions Add or remove nodes on demand Effectively increase the throughput of the cluster Increase the utilization of the cluster/cloud http://info.massopencloud.org 17