Kubernetes Modifications for GPUs

Slides:

Advertisements

Similar presentations

Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.

Advertisements

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

Network Aware Resource Allocation in Distributed Clouds.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

Progress Report 2014/02/12. Previous in IPDPS’14 Energy-efficient task scheduling on per- core DVFS architecture ◦ Batch mode  Tasks with arrival time.

CHT Project Progress Report 10/07 Simon. CHT Project Develop a resource management scheduling algorithm for CHT datacenter. ◦ Two types of jobs, interactive/latency-

MySQL HA An overview Kris Buytaert. ● Senior Linux and Open Source ● „Infrastructure Architect“ ● I don't remember when I started.

Keys and adding, deleting and modifying records in an array ● Record Keys ● Reading and Adding Records ● Partition or Sentinels Marking Space in Use ●

Unit 2 Technology Systems

Job Scheduling and Runtime in DLWorkspace

TensorFlow– A system for large-scale machine learning

A New Distributed Processing Framework

Module 3: Operating-System Structures

Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.

Overview Parallel Processing Pipelining

DL (Deep Learning) Workspace

Current Generation Hypervisor Type 1 Type 2.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

CHT Project Progress Report

Running Multiple Schedulers in Kubernetes

Distributed Processors

Large-scale file systems and Map-Reduce

Greedy Method 6/22/2018 6:57 PM Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Daniel King, Old Dog Consulting Adrian Farrel, Old Dog Consulting

Mechanism: Limited Direct Execution

ICMP ICMP – Internet Control Message Protocol

Goals of soBGP Verify the origin of advertisements

What Are Routers? Routers are an intermediate system at the network layer that is used to connect networks together based on a common network layer protocol.

TensorFlow on Kubernetes with GPU Enabled

The Greedy Method and Text Compression

Operating System Structure

Introduction to Parallelism.

Multi-Processing in High Performance Computer Architecture:

The Greedy Method and Text Compression

Introduction to Computers

Using GPUs with Molecular Dynamics codes: optimizing usage from a user perspective Dr. Ole Juul Andersen CCK-11 February 1, 2018.

Internet Networking recitation #12

Transfer of data in ICT systems

Intra-Domain Routing Jacob Strauss September 14, 2006.

OPERATING SYSTEM OVERVIEW

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Kubernetes Container Orchestration

Multi-Processing in High Performance Computer Architecture:

Introducing Forms.

湖南大学-信息科学与工程学院-计算机与科学系

Chapter 17: Database System Architectures

Jason furmanek Kris murphy IBM

Getting Started with Kubernetes and Rancher 2.0

CPU SCHEDULING.

Container cluster management solutions

Top Half / Bottom Half Processing

Chapter 2: Operating-System Structures

Chapter 2: The Linux System Part 5

Introduction to Operating Systems

1 Multi-Protocol Label Switching (MPLS). 2 MPLS Overview A forwarding scheme designed to speed up IP packet forwarding (RFC 3031) Idea: use a fixed length.

CS 6290 Many-core & Interconnect

Ch 17 - Binding Protocol Addresses

OpenShift as a cloud for Data Science

Database System Architectures

Dynamic Routing Protocols part3 B

Chapter 2: Operating-System Structures

L13. Capacitated and uncapacitated network design problems

Operating System Overview

CSE 326: Data Structures Lecture #14

Managing allocatable resources

CS137: Electronic Design Automation

Presentation transcript:

Kubernetes Modifications for GPUs Sanjeev Mehrotra

Kubernetes resource scheduling Terminology: Allocatable – what is available at node Used – what is already being used from node (called RequestedResource) Requests– what is requested by container(s) for the pod Kubelets send “Allocatable” resources for nodes Worker 1 Scheduler – Keeps track of “Used” Worker 2 Scheduling Request Worker N Pod (Contianer) Spec - Container Requests

Resources All resources (allocatable, used, and requests) are represented as a “ResourceList” which is simply a list of key-value pairs, e.g. memory : 64GiB cpu : 8

Simple scheduling Find worker nodes that can “fit” a pod spec plugin/pkg/scheduler/algorithm/predicates 2. Prioritize list of nodes plugin/pkg/scheduler/algorithm/priorities 3. Try to schedule pod on node – node may have additional admission policy so pod may fail 4. If fails, try next node on list

Find nodes that fit For simple scheduling, node will NOT fit if Allocatable < Request + Used Example if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU)) } if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory)) if allocatable.NvidiaGPU < podRequest.NvidiaGPU+nodeInfo.RequestedResource().NvidiaGPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceNvidiaGPU, podRequest.NvidiaGPU, nodeInfo.RequestedResource().NvidiaGPU, allocatable.NvidiaGPU))

Why do we need modifications? Only allows for constraints like following in pod spec Need 4 GPUs Does NOT allow for constraints like following in pod spec Need 4 GPUs with minimum memory 12GiB OR Need 2 GPUs with minimum memory 4GiB and 2 GPUs with 12GiB Need 2 GPUs interconnected via NVLink (peer-to-peer for high speed inter- GPU communication)

Solution 1 Label nodes and use node selector https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ However, not optimal in cases with heterogeneous configurations For example, one machine may have GPUs of several types, some with large amounts of memory and some with small If label used, then don’t know which GPUs will get assigned. Thus only minimally performant GPU can be used to label node Also even in homogenous configurations, kubelet running on worker nodes needs to keep track of bookkeeping and which GPUs are in use

Solution 2 – Group Scheduler Define richer syntax on ResourceLists to allow for such constraints to be scheduled Example: Instead of: NvidiaGPU: 2 Use something like – now memory for each GPU is clearly specified Gpu/0/cards: 1 Gpu/0/memory: 12GiB Gpu/1/cards: 1 Gpu/1/memory: 6GiB Use of “cards” is present to prevent sharing of GPU cards

Example – GPU with NVLink For 4 GPUs with two groups, each connected via NVLink to another GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB GpuGrp0 Gpu0 Gpu1 GpuGrp1 Gpu2 Gpu3

Group scheduler All resource lists (allocatable, used, and requests) specified in this manner Scheduling can no longer compare values with same key to see “fit” e.g: allocatable[“memory”] < used[“memory”] + requested[“memory”] Example Requested (two GPUs minimum memory 10GiB, don’t require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB

Group scheduler Group scheduler – uses hierarchical group allocation with arbitrary scorers to accomplish both checking for “fit” and “allocation” “Allocation” is a string-to-string key-value which specifies a mapping from “Requests” to “Allocatable” Requested (two GPUs minimum memory 10GiB, don’t require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB

Group Allocation Allocatable Gpugrp1/0/Gpugrp0/0/gpu/dev0/cards: 1 Requests Gpugrp1/R0/Gpugrp0/RA/gpu/gpu0/cards: 1 Gpugrp1/R0/Gpugrp0/RA/gpu/gpu1/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu2/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu3/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu4/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu5/cards: 1 Requests Allocatable

Main Modifications – scheduler side Addition of AllocateFrom field in pod specification. This is a list of key-value pairs which specify mapping from “Requests” to “Allocatable” pkg/api/types.go Addition of group scheduler code plugin/pkg/scheduler/algorithm/predicates/grpallocate.go plugin/pkg/scheduler/algorithm/scorer Modification in scheduler to write pod update after scheduling and to call group allocator plugin/pkg/scheduler/generic_scheduler.go plugin/pkg/scheduler/scheduler.go

Kubelet modifications Existing multi-GPU code makes the kubelet do the work of keeping track of which GPUs are available and uses /dev/nvidia* to see number of devices, both of which are hacks With addition of “AllocateFrom” field, scheduler decides which GPUs to use and keeps track of which ones are in use.

Main Modifications – kubelet side Use of AllocateFrom to decide which GPUs to use Use of nvidia-docker-plugin to find GPUs (instead of looking at /dev/nvidia*) This is also needed to get richer information such as memory in GPU, GPU type, topology information (i.e. NVLink) Use of nvidia-docker-plugin to find correct location for nvidia drivers inside container (in conjunction with nvidia-docker driver) Allow specification of driver when specifying mount – needed to use nvidia-docker driver

Integration with community Kubelets know nothing about GPUs Eventual goal Resources to advertise Kubelets send “Allocatable” resources for nodes Worker 1 Device Plugins (e.g. GPU) Resources usage / docker params Asks for fit Scheduler – Keeps track of “Used” Worker 2 Scheduler extender Scheduling Request Performs group allocation – writes update to pod spec with allocation Worker N Pod (Contianer) Spec - Container Requests

Needed in Kubernetes core We will need a few things in order to achieve separation with core which will allow for directly using latest Kubernetes binaries Resource Class, scheduled for v1.9 will allow for non-identity mappings between requests and allocatable Device plugins and native Nvidia GPU support is v1.13 for now https://docs.google.com/a/google.com/spreadsheets/d/1NWarIgtSLsq 3izc5wOzV7ItdhDNRd-6oBVawmvs-LGw

Other future Kubernetes/Scheduler work Pod placement using other constraints such as pod-level constraints or higher (e.g. multiple pods for distributed training) For example, networking constraints for distributed training when scheduling Container networking for faster cross-pod communication (e.g. using RDMA / IB)