Kubernetes Modifications for GPUs Sanjeev Mehrotra
Kubernetes resource scheduling Terminology: Allocatable – what is available at node Used – what is already being used from node (called RequestedResource) Requests– what is requested by container(s) for the pod Kubelets send “Allocatable” resources for nodes Worker 1 Scheduler – Keeps track of “Used” Worker 2 Scheduling Request Worker N Pod (Contianer) Spec - Container Requests
Resources All resources (allocatable, used, and requests) are represented as a “ResourceList” which is simply a list of key-value pairs, e.g. memory : 64GiB cpu : 8
Simple scheduling Find worker nodes that can “fit” a pod spec plugin/pkg/scheduler/algorithm/predicates 2. Prioritize list of nodes plugin/pkg/scheduler/algorithm/priorities 3. Try to schedule pod on node – node may have additional admission policy so pod may fail 4. If fails, try next node on list
Find nodes that fit For simple scheduling, node will NOT fit if Allocatable < Request + Used Example if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU)) } if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory)) if allocatable.NvidiaGPU < podRequest.NvidiaGPU+nodeInfo.RequestedResource().NvidiaGPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceNvidiaGPU, podRequest.NvidiaGPU, nodeInfo.RequestedResource().NvidiaGPU, allocatable.NvidiaGPU))
Why do we need modifications? Only allows for constraints like following in pod spec Need 4 GPUs Does NOT allow for constraints like following in pod spec Need 4 GPUs with minimum memory 12GiB OR Need 2 GPUs with minimum memory 4GiB and 2 GPUs with 12GiB Need 2 GPUs interconnected via NVLink (peer-to-peer for high speed inter- GPU communication)
Solution 1 Label nodes and use node selector https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ However, not optimal in cases with heterogeneous configurations For example, one machine may have GPUs of several types, some with large amounts of memory and some with small If label used, then don’t know which GPUs will get assigned. Thus only minimally performant GPU can be used to label node Also even in homogenous configurations, kubelet running on worker nodes needs to keep track of bookkeeping and which GPUs are in use
Solution 2 – Group Scheduler Define richer syntax on ResourceLists to allow for such constraints to be scheduled Example: Instead of: NvidiaGPU: 2 Use something like – now memory for each GPU is clearly specified Gpu/0/cards: 1 Gpu/0/memory: 12GiB Gpu/1/cards: 1 Gpu/1/memory: 6GiB Use of “cards” is present to prevent sharing of GPU cards
Example – GPU with NVLink For 4 GPUs with two groups, each connected via NVLink to another GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB GpuGrp0 Gpu0 Gpu1 GpuGrp1 Gpu2 Gpu3
Group scheduler All resource lists (allocatable, used, and requests) specified in this manner Scheduling can no longer compare values with same key to see “fit” e.g: allocatable[“memory”] < used[“memory”] + requested[“memory”] Example Requested (two GPUs minimum memory 10GiB, don’t require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB
Group scheduler Group scheduler – uses hierarchical group allocation with arbitrary scorers to accomplish both checking for “fit” and “allocation” “Allocation” is a string-to-string key-value which specifies a mapping from “Requests” to “Allocatable” Requested (two GPUs minimum memory 10GiB, don’t require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB
Group Allocation Allocatable Gpugrp1/0/Gpugrp0/0/gpu/dev0/cards: 1 Requests Gpugrp1/R0/Gpugrp0/RA/gpu/gpu0/cards: 1 Gpugrp1/R0/Gpugrp0/RA/gpu/gpu1/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu2/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu3/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu4/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu5/cards: 1 Requests Allocatable
Main Modifications – scheduler side Addition of AllocateFrom field in pod specification. This is a list of key-value pairs which specify mapping from “Requests” to “Allocatable” pkg/api/types.go Addition of group scheduler code plugin/pkg/scheduler/algorithm/predicates/grpallocate.go plugin/pkg/scheduler/algorithm/scorer Modification in scheduler to write pod update after scheduling and to call group allocator plugin/pkg/scheduler/generic_scheduler.go plugin/pkg/scheduler/scheduler.go
Kubelet modifications Existing multi-GPU code makes the kubelet do the work of keeping track of which GPUs are available and uses /dev/nvidia* to see number of devices, both of which are hacks With addition of “AllocateFrom” field, scheduler decides which GPUs to use and keeps track of which ones are in use.
Main Modifications – kubelet side Use of AllocateFrom to decide which GPUs to use Use of nvidia-docker-plugin to find GPUs (instead of looking at /dev/nvidia*) This is also needed to get richer information such as memory in GPU, GPU type, topology information (i.e. NVLink) Use of nvidia-docker-plugin to find correct location for nvidia drivers inside container (in conjunction with nvidia-docker driver) Allow specification of driver when specifying mount – needed to use nvidia-docker driver
Integration with community Kubelets know nothing about GPUs Eventual goal Resources to advertise Kubelets send “Allocatable” resources for nodes Worker 1 Device Plugins (e.g. GPU) Resources usage / docker params Asks for fit Scheduler – Keeps track of “Used” Worker 2 Scheduler extender Scheduling Request Performs group allocation – writes update to pod spec with allocation Worker N Pod (Contianer) Spec - Container Requests
Needed in Kubernetes core We will need a few things in order to achieve separation with core which will allow for directly using latest Kubernetes binaries Resource Class, scheduled for v1.9 will allow for non-identity mappings between requests and allocatable Device plugins and native Nvidia GPU support is v1.13 for now https://docs.google.com/a/google.com/spreadsheets/d/1NWarIgtSLsq 3izc5wOzV7ItdhDNRd-6oBVawmvs-LGw
Other future Kubernetes/Scheduler work Pod placement using other constraints such as pod-level constraints or higher (e.g. multiple pods for distributed training) For example, networking constraints for distributed training when scheduling Container networking for faster cross-pod communication (e.g. using RDMA / IB)