Kubernetes Modifications for GPUs

Slides:



Advertisements
Similar presentations
Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.
Advertisements

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Network Aware Resource Allocation in Distributed Clouds.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.
Progress Report 2014/02/12. Previous in IPDPS’14 Energy-efficient task scheduling on per- core DVFS architecture ◦ Batch mode  Tasks with arrival time.
CHT Project Progress Report 10/07 Simon. CHT Project Develop a resource management scheduling algorithm for CHT datacenter. ◦ Two types of jobs, interactive/latency-
MySQL HA An overview Kris Buytaert. ● Senior Linux and Open Source ● „Infrastructure Architect“ ● I don't remember when I started.
Keys and adding, deleting and modifying records in an array ● Record Keys ● Reading and Adding Records ● Partition or Sentinels Marking Space in Use ●
Unit 2 Technology Systems
Job Scheduling and Runtime in DLWorkspace
TensorFlow– A system for large-scale machine learning
A New Distributed Processing Framework
Module 3: Operating-System Structures
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Overview Parallel Processing Pipelining
DL (Deep Learning) Workspace
Current Generation Hypervisor Type 1 Type 2.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
CHT Project Progress Report
Running Multiple Schedulers in Kubernetes
Distributed Processors
Large-scale file systems and Map-Reduce
Greedy Method 6/22/2018 6:57 PM Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.
Daniel King, Old Dog Consulting Adrian Farrel, Old Dog Consulting
Mechanism: Limited Direct Execution
ICMP ICMP – Internet Control Message Protocol
Goals of soBGP Verify the origin of advertisements
What Are Routers? Routers are an intermediate system at the network layer that is used to connect networks together based on a common network layer protocol.
TensorFlow on Kubernetes with GPU Enabled
The Greedy Method and Text Compression
Operating System Structure
Introduction to Parallelism.
Multi-Processing in High Performance Computer Architecture:
B+-Trees.
The Greedy Method and Text Compression
Introduction to Computers
Using GPUs with Molecular Dynamics codes: optimizing usage from a user perspective Dr. Ole Juul Andersen CCK-11 February 1, 2018.
Internet Networking recitation #12
Transfer of data in ICT systems
Intra-Domain Routing Jacob Strauss September 14, 2006.
OPERATING SYSTEM OVERVIEW
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Kubernetes Container Orchestration
Multi-Processing in High Performance Computer Architecture:
Introducing Forms.
湖南大学-信息科学与工程学院-计算机与科学系
Chapter 17: Database System Architectures
Jason furmanek Kris murphy IBM
Getting Started with Kubernetes and Rancher 2.0
CPU SCHEDULING.
Container cluster management solutions
Top Half / Bottom Half Processing
Chapter 2: Operating-System Structures
Chapter 2: The Linux System Part 5
Introduction to Operating Systems
1 Multi-Protocol Label Switching (MPLS). 2 MPLS Overview A forwarding scheme designed to speed up IP packet forwarding (RFC 3031) Idea: use a fixed length.
CS 6290 Many-core & Interconnect
Ch 17 - Binding Protocol Addresses
OpenShift as a cloud for Data Science
Database System Architectures
Dynamic Routing Protocols part3 B
Chapter 2: Operating-System Structures
L13. Capacitated and uncapacitated network design problems
Operating System Overview
CSE 326: Data Structures Lecture #14
Managing allocatable resources
CS137: Electronic Design Automation
Presentation transcript:

Kubernetes Modifications for GPUs Sanjeev Mehrotra

Kubernetes resource scheduling Terminology: Allocatable – what is available at node Used – what is already being used from node (called RequestedResource) Requests– what is requested by container(s) for the pod Kubelets send “Allocatable” resources for nodes Worker 1 Scheduler – Keeps track of “Used” Worker 2 Scheduling Request Worker N Pod (Contianer) Spec - Container Requests

Resources All resources (allocatable, used, and requests) are represented as a “ResourceList” which is simply a list of key-value pairs, e.g. memory : 64GiB cpu : 8

Simple scheduling Find worker nodes that can “fit” a pod spec plugin/pkg/scheduler/algorithm/predicates 2. Prioritize list of nodes plugin/pkg/scheduler/algorithm/priorities 3. Try to schedule pod on node – node may have additional admission policy so pod may fail 4. If fails, try next node on list

Find nodes that fit For simple scheduling, node will NOT fit if Allocatable < Request + Used Example    if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {         predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))     }     if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {         predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))     if allocatable.NvidiaGPU < podRequest.NvidiaGPU+nodeInfo.RequestedResource().NvidiaGPU {         predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceNvidiaGPU, podRequest.NvidiaGPU, nodeInfo.RequestedResource().NvidiaGPU, allocatable.NvidiaGPU))

Why do we need modifications? Only allows for constraints like following in pod spec Need 4 GPUs Does NOT allow for constraints like following in pod spec Need 4 GPUs with minimum memory 12GiB OR Need 2 GPUs with minimum memory 4GiB and 2 GPUs with 12GiB Need 2 GPUs interconnected via NVLink (peer-to-peer for high speed inter- GPU communication)

Solution 1 Label nodes and use node selector https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ However, not optimal in cases with heterogeneous configurations For example, one machine may have GPUs of several types, some with large amounts of memory and some with small If label used, then don’t know which GPUs will get assigned. Thus only minimally performant GPU can be used to label node Also even in homogenous configurations, kubelet running on worker nodes needs to keep track of bookkeeping and which GPUs are in use

Solution 2 – Group Scheduler Define richer syntax on ResourceLists to allow for such constraints to be scheduled Example: Instead of: NvidiaGPU: 2 Use something like – now memory for each GPU is clearly specified Gpu/0/cards: 1 Gpu/0/memory: 12GiB Gpu/1/cards: 1 Gpu/1/memory: 6GiB Use of “cards” is present to prevent sharing of GPU cards

Example – GPU with NVLink For 4 GPUs with two groups, each connected via NVLink to another GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB GpuGrp0 Gpu0 Gpu1 GpuGrp1 Gpu2 Gpu3

Group scheduler All resource lists (allocatable, used, and requests) specified in this manner Scheduling can no longer compare values with same key to see “fit” e.g: allocatable[“memory”] < used[“memory”] + requested[“memory”] Example Requested (two GPUs minimum memory 10GiB, don’t require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB

Group scheduler Group scheduler – uses hierarchical group allocation with arbitrary scorers to accomplish both checking for “fit” and “allocation” “Allocation” is a string-to-string key-value which specifies a mapping from “Requests” to “Allocatable” Requested (two GPUs minimum memory 10GiB, don’t require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB

Group Allocation Allocatable Gpugrp1/0/Gpugrp0/0/gpu/dev0/cards: 1 Requests Gpugrp1/R0/Gpugrp0/RA/gpu/gpu0/cards: 1 Gpugrp1/R0/Gpugrp0/RA/gpu/gpu1/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu2/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu3/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu4/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu5/cards: 1 Requests Allocatable

Main Modifications – scheduler side Addition of AllocateFrom field in pod specification. This is a list of key-value pairs which specify mapping from “Requests” to “Allocatable” pkg/api/types.go Addition of group scheduler code plugin/pkg/scheduler/algorithm/predicates/grpallocate.go plugin/pkg/scheduler/algorithm/scorer Modification in scheduler to write pod update after scheduling and to call group allocator plugin/pkg/scheduler/generic_scheduler.go plugin/pkg/scheduler/scheduler.go

Kubelet modifications Existing multi-GPU code makes the kubelet do the work of keeping track of which GPUs are available and uses /dev/nvidia* to see number of devices, both of which are hacks With addition of “AllocateFrom” field, scheduler decides which GPUs to use and keeps track of which ones are in use.

Main Modifications – kubelet side Use of AllocateFrom to decide which GPUs to use Use of nvidia-docker-plugin to find GPUs (instead of looking at /dev/nvidia*) This is also needed to get richer information such as memory in GPU, GPU type, topology information (i.e. NVLink) Use of nvidia-docker-plugin to find correct location for nvidia drivers inside container (in conjunction with nvidia-docker driver) Allow specification of driver when specifying mount – needed to use nvidia-docker driver

Integration with community Kubelets know nothing about GPUs Eventual goal Resources to advertise Kubelets send “Allocatable” resources for nodes Worker 1 Device Plugins (e.g. GPU) Resources usage / docker params Asks for fit Scheduler – Keeps track of “Used” Worker 2 Scheduler extender Scheduling Request Performs group allocation – writes update to pod spec with allocation Worker N Pod (Contianer) Spec - Container Requests

Needed in Kubernetes core We will need a few things in order to achieve separation with core which will allow for directly using latest Kubernetes binaries Resource Class, scheduled for v1.9 will allow for non-identity mappings between requests and allocatable Device plugins and native Nvidia GPU support is v1.13 for now https://docs.google.com/a/google.com/spreadsheets/d/1NWarIgtSLsq 3izc5wOzV7ItdhDNRd-6oBVawmvs-LGw

Other future Kubernetes/Scheduler work Pod placement using other constraints such as pod-level constraints or higher (e.g. multiple pods for distributed training) For example, networking constraints for distributed training when scheduling Container networking for faster cross-pod communication (e.g. using RDMA / IB)