Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University kumar.raju@husky.neu.edu.

Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University Evan Weinberg, Boston University Chris Hill, Massachusetts Institute of Technology

Goal Grow and shrink the cluster dynamically
Hold the state of the jobs processed to save the invested computing cycles Run multiple jobs on same resources based on their priority adding or removing nodes on the fly based on some trigger jobs arrives, provision resources Don’t loose the compute cycles already invested on the job So to move a job away from a set of resources instead of killing or requeing the job, hold the state and continue from that later run multiple jobs driven by implicit or explicit priority of the jobs Implicit – actual workload over hpc jobs explicit- diff partitions or OSG Eg of low priority job running and moving out on high priority

Approach to proof of concept
OpenStack cloud to spawn VM as compute nodes Suspend and resume feature of the VM to hold the state of the jobs Open Science Grid jobs as the workload Single core Self-contained, high throughput Doesn’t require explicit high performance resources Use components that modify the existing system as least as possible We have built a proof of concept around this and, our approach was to Use openstack cloud to spawn VM that will be configured to act as compute nodes of the slurm cluster and use the properties of VM to hold and restore the state of jobs To test the real workload, we’ll be using the single node self contained open science grid jobs that doesn’t need high performing resources and we’ll used other supporting components that were either in line with the existing cluster or required least modifications to existing setup and provided a path of least resilience 3

Existing Slurm cluster
Controller Compute So lets say we have and existing slurm cluster with a controller and multiple compute nodes And an openstack cloud subnet to provision VM and extend this cluster onto These are two independent clusters on isolated networks that may be within same data centers or different ones Existing Slurm Cluster OpenStack Cloud subnet Existing Slurm cluster with controller and compute nodes OpenStack cloud subnet to extend the existing cluster Isolated clusters may or may not reside within the same datacenter

Controller NFS Compute These clusters generally have a shared file system to share the job processing bits like datsets, logs, results etc across the node Which is hosted by a NFS. We could use this as the jobs we’ll be running are self sufficient and doesn’t depend on high performance file systems Existing Slurm Cluster OpenStack Cloud subnet Shared file system across the cluster using NFS to share the relevant bits for job processing

Controller NFS Compute DNS Along with this we have a dns server to locate the nodes in the cluster on hostname This is an existing slurm cluster. N So what all we need to make this cluster dynamic Existing Slurm Cluster OpenStack Cloud subnet DNS server to locate the nodes in the cluster

Monitor job queue Monitor daemon listens for a trigger to spawn nodes
Controller Compute NFS Existing Slurm Cluster OpenStack Cloud subnet DNS To make this cluster dynamic, we have developed a monitor That waits for some event which is configured to trigger the dynamic provisioning of the nodes and prioritized execution of the jobs In this case, it listens to the job queues for the jobs that are eligible to run the virtual compute nodes Monitor daemon listens for a trigger to spawn nodes (job queue for the specific workload/OSG jobs)

Dynamic resource provisioning
Monitor Controller Provisioning Compute NFS Existing Slurm Cluster Compute OpenStack Cloud subnet DNS If the job arrives in the queue, Monitor calls some provisioning agent that spawns a VM. Here we have used salt-cloud, a cloud provisioning agent from a configuration management tool Salt. Used salt as its being used by the existing cluster and so we could leverage that. Else cloud be any tool that does this job Salt-cloud connect to oepnstack, Provisions a VM And installs the salt agent so that it could later be configured using salt Monitors calls the provisioning agent(Salt-cloud) to spawn the node/VM if a job comes in

Channel between clusters
Monitor Controller Provisioning Compute NFS Existing Slurm Cluster Compute OpenStack Cloud subnet DNS Nodes needs some channel to go across provides a secured channel , supports diff protocols and allow access using private nw address Gateway nodes with floating IP to run openvpn service Route the traffic via gateway Common channel…easy to configure and maintain than point to point connectivity Gateway Gateway OpenVPN OpenVPN provides a secured channel to augment two networks Route the traffic for the other cluster to the gateway node 9

Configure new compute node
Monitor Controller Config tool Config files Compute NFS Compute Existing Slurm Cluster Hosts OpenStack Cloud subnet DNS Salt configures the node Connects to nfs fs from where necessary config files are pulled Updates the hosts file with the network address Used by dns to have the location of the nodes Which used by controller to identify the new node on config reload Gateway Gateway OpenVPN Configuration tool(Salt) configures the node for Slurm Pulls necessary configuration files from NFS Updates the hosts file with IP address of the new node for DNS

Running low priority job
Monitor Controller Provision Config tool Compute NFS Compute Existing Slurm Cluster OpenStack Cloud subnet DNS OSG Jobs are allocated to the newly federated node If a high priority job comes in Eligible to use the same resources used by the compute node running the OSG job Hav already invested some compute cycles…so save the state rather than kill or requeue the job Gateway Gateway OpenVPN Low priority(OSG) job is running on the newly federated compute node

Suspend the VM for high priority job
Monitor Controller Provision Config tool Compute NFS Compute Existing Slurm Cluster OpenStack Cloud subnet DNS Susppend job and update state Suspend the node on OS Persist vm state KVM releases the resources Leverage this to spwan another vm on same resorurces Gateway Gateway OpenVPN Suspend the job via Slurm; running on the compute node Call OpenStack to suspend the VM Get the resources for higher priority job

Running high priority job
Monitor Controller Provision Config tool Compute NFS Compute Compute Existing Slurm Cluster OpenStack Cloud subnet DNS Spwan a new VM by the same process utilizing the same resources used by previous VM High priority job run on new VM Monitor keeps a check on the state of the job for completion Gateway Gateway OpenVPN Spawns a new VM as slurm compute Configures it and runs the high priority job on it

Resume the VM/job Monitor Controller Provision Config tool NFS Compute Compute Existing Slurm Cluster DNS OpenStack Cloud subnet NTP Removes the high priority VM Resumes the suspended VM VM start with the time that was at the point when it was last suspended Causes time diff Syncs the clock with NTP Stay in sync with the cluster to avoid token expiry and auth issues Shows dynamically provisioning a node, running the job on it and saving the state when higher priority job arrives and resume the job Gateway Gateway OpenVPN On high priority job completion, removes the priority VM Resumes the suspended VM Sync system clock with NTP server to avoid token expiration

Conclusion and next steps
Implemented in a staging environment Actively Extending this to existing Slurm cluster on Engage1 Modify Slurm to incorporate essential features, transfer mods to Slurm core So where are we with this

Future works More extensive monitor capabilities
Harden and utilize full data center performance (hardware, network etc.) Enhance the implementation for more complex high performing applications Extend the idea to physical nodes Experiment with container frameworks More triggers to drive the cluster Better driving algo to move resources Currently runs single node compute jobs that focuse on throughput than peroformance

Conclusions Add or remove nodes on demand
Effectively increase the throughput of the cluster Increase the utilization of the cluster/cloud 17

Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University kumar.raju@husky.neu.edu.

Similar presentations

Presentation on theme: "Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University kumar.raju@husky.neu.edu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University kumar.raju@husky.neu.edu.

Similar presentations

Presentation on theme: "Hello everyone I am rajul and I’ll be speaking on a case study of elastic slurm Case Study of Elastic Slurm Rajul Kumar, Northeastern University kumar.raju@husky.neu.edu."— Presentation transcript:

Similar presentations

About project

Feedback