Kubernetes Modifications for GPUs

Kubernetes Modifications for GPUs
Sanjeev Mehrotra

Kubernetes resource scheduling
Terminology: Allocatable – what is available at node Used – what is already being used from node (called RequestedResource) Requests– what is requested by container(s) for the pod Kubelets send “Allocatable” resources for nodes Worker 1 Scheduler – Keeps track of “Used” Worker 2 Scheduling Request Worker N Pod (Contianer) Spec - Container Requests

Resources All resources (allocatable, used, and requests) are represented as a “ResourceList” which is simply a list of key-value pairs, e.g. memory : 64GiB cpu : 8

Simple scheduling Find worker nodes that can “fit” a pod spec
plugin/pkg/scheduler/algorithm/predicates 2. Prioritize list of nodes plugin/pkg/scheduler/algorithm/priorities 3. Try to schedule pod on node – node may have additional admission policy so pod may fail 4. If fails, try next node on list

Find nodes that fit For simple scheduling, node will NOT fit if
Allocatable < Request + Used Example if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU)) } if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory)) if allocatable.NvidiaGPU < podRequest.NvidiaGPU+nodeInfo.RequestedResource().NvidiaGPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceNvidiaGPU, podRequest.NvidiaGPU, nodeInfo.RequestedResource().NvidiaGPU, allocatable.NvidiaGPU))

Why do we need modifications?
Only allows for constraints like following in pod spec Need 4 GPUs Does NOT allow for constraints like following in pod spec Need 4 GPUs with minimum memory 12GiB OR Need 2 GPUs with minimum memory 4GiB and 2 GPUs with 12GiB Need 2 GPUs interconnected via NVLink (peer-to-peer for high speed inter- GPU communication)

Solution 1 Label nodes and use node selector
However, not optimal in cases with heterogeneous configurations For example, one machine may have GPUs of several types, some with large amounts of memory and some with small If label used, then don’t know which GPUs will get assigned. Thus only minimally performant GPU can be used to label node Also even in homogenous configurations, kubelet running on worker nodes needs to keep track of bookkeeping and which GPUs are in use

Solution 2 – Group Scheduler
Define richer syntax on ResourceLists to allow for such constraints to be scheduled Example: Instead of: NvidiaGPU: 2 Use something like – now memory for each GPU is clearly specified Gpu/0/cards: 1 Gpu/0/memory: 12GiB Gpu/1/cards: 1 Gpu/1/memory: 6GiB Use of “cards” is present to prevent sharing of GPU cards

Example – GPU with NVLink
For 4 GPUs with two groups, each connected via NVLink to another GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB GpuGrp0 Gpu0 Gpu1 GpuGrp1 Gpu2 Gpu3

Group scheduler All resource lists (allocatable, used, and requests) specified in this manner Scheduling can no longer compare values with same key to see “fit” e.g: allocatable[“memory”] < used[“memory”] + requested[“memory”] Example Requested (two GPUs minimum memory 10GiB, don’t require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB

Group scheduler Group scheduler – uses hierarchical group allocation with arbitrary scorers to accomplish both checking for “fit” and “allocation” “Allocation” is a string-to-string key-value which specifies a mapping from “Requests” to “Allocatable” Requested (two GPUs minimum memory 10GiB, don’t require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB

Group Allocation Allocatable Gpugrp1/0/Gpugrp0/0/gpu/dev0/cards: 1
Requests Gpugrp1/R0/Gpugrp0/RA/gpu/gpu0/cards: 1 Gpugrp1/R0/Gpugrp0/RA/gpu/gpu1/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu2/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu3/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu4/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu5/cards: 1 Requests Allocatable

Main Modifications – scheduler side
Addition of AllocateFrom field in pod specification. This is a list of key-value pairs which specify mapping from “Requests” to “Allocatable” pkg/api/types.go Addition of group scheduler code plugin/pkg/scheduler/algorithm/predicates/grpallocate.go plugin/pkg/scheduler/algorithm/scorer Modification in scheduler to write pod update after scheduling and to call group allocator plugin/pkg/scheduler/generic_scheduler.go plugin/pkg/scheduler/scheduler.go

Kubelet modifications
Existing multi-GPU code makes the kubelet do the work of keeping track of which GPUs are available and uses /dev/nvidia* to see number of devices, both of which are hacks With addition of “AllocateFrom” field, scheduler decides which GPUs to use and keeps track of which ones are in use.

Main Modifications – kubelet side
Use of AllocateFrom to decide which GPUs to use Use of nvidia-docker-plugin to find GPUs (instead of looking at /dev/nvidia*) This is also needed to get richer information such as memory in GPU, GPU type, topology information (i.e. NVLink) Use of nvidia-docker-plugin to find correct location for nvidia drivers inside container (in conjunction with nvidia-docker driver) Allow specification of driver when specifying mount – needed to use nvidia-docker driver

Integration with community
Kubelets know nothing about GPUs Eventual goal Resources to advertise Kubelets send “Allocatable” resources for nodes Worker 1 Device Plugins (e.g. GPU) Resources usage / docker params Asks for fit Scheduler – Keeps track of “Used” Worker 2 Scheduler extender Scheduling Request Performs group allocation – writes update to pod spec with allocation Worker N Pod (Contianer) Spec - Container Requests

Needed in Kubernetes core
We will need a few things in order to achieve separation with core which will allow for directly using latest Kubernetes binaries Resource Class, scheduled for v1.9 will allow for non-identity mappings between requests and allocatable Device plugins and native Nvidia GPU support is v1.13 for now 3izc5wOzV7ItdhDNRd-6oBVawmvs-LGw

Other future Kubernetes/Scheduler work
Pod placement using other constraints such as pod-level constraints or higher (e.g. multiple pods for distributed training) For example, networking constraints for distributed training when scheduling Container networking for faster cross-pod communication (e.g. using RDMA / IB)

Kubernetes Modifications for GPUs

Similar presentations

Presentation on theme: "Kubernetes Modifications for GPUs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kubernetes Modifications for GPUs

Similar presentations

Presentation on theme: "Kubernetes Modifications for GPUs"— Presentation transcript:

Similar presentations

About project

Feedback