GPU Scheduling on the NVIDIA TX2:

Slides:



Advertisements
Similar presentations
Optimization on Kepler Zehuan Wang
Advertisements

Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Real-Time Kernels and Operating Systems. Operating System: Software that coordinates multiple tasks in processor, including peripheral interfacing Types.
1Chapter 05, Fall 2008 CPU Scheduling The CPU scheduler (sometimes called the dispatcher or short-term scheduler): Selects a process from the ready queue.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
OpenCL Programming James Perry EPCC The University of Edinburgh.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Unit 4: Processes, Threads & Deadlocks June 2012 Kaplan University 1.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CUDA programming Performance considerations (CUDA best practices)
Lecturer 5: Process Scheduling Process Scheduling  Criteria & Objectives Types of Scheduling  Long term  Medium term  Short term CPU Scheduling Algorithms.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
Lecture 6 The Rest of Scheduling Algorithms and The Beginning of Memory Management.
Multiprogramming. Readings r Chapter 2.1 of the textbook.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Processes and threads.
GPU Computing CIS-543 Lecture 10: Streams and Events
Process Management Process Concept Why only the global variables?
CS427 Multicore Architecture and Parallel Computing
EEE Embedded Systems Design Process in Operating Systems 서강대학교 전자공학과
Processes and Threads Processes and their scheduling
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Basic CUDA Programming
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Intro to Processes CSSE 332 Operating Systems
Chapter 2.2 : Process Scheduling
Real-time Software Design
Lecture 2: Intro to the simd lifestyle and GPU internals
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Process management Information maintained by OS for process management
ICS 143 Principles of Operating Systems
Intro. To Operating Systems
More on GPU Programming
Chapter 2: The Linux System Part 3
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CPU SCHEDULING.
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Chapter 4: Threads & Concurrency
Uniprocessor scheduling
Foundations and Definitions
6- General Purpose GPU Programming
CMSC 202 Threads.
Threads CSE 2431: Introduction to Operating Systems
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

GPU Scheduling on the NVIDIA TX2: CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, F. Donelson Smith University of North Carolina at Chapel Hill

Do we have any guarantees about execution order? CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Motivation Do we have any guarantees about execution order? Climate Control Steering Turtle Detection

CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work The Challenge Size, weight, and power (SWaP) constraints require embedded computing platforms, which limit processing power. We must keep utilization as high as possible.

CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work The Challenge Size, weight, and power (SWaP) constraints require embedded computing platforms, which limit processing power. We must keep utilization as high as possible. NVIDIA GPUs are treated as black boxes, but used in safety-critical applications. These devices must be certified, but we need a model for GPU execution that allows concurrent execution.

Outline Motivation CUDA Fundamentals GPU Scheduling Rules Basic Rules Extended Rules Motivation Future Work Outline Motivation CUDA Fundamentals GPU Scheduling Rules Extensions to Rules Future work

CUDA Programming Model CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory input output CPU GPU

CUDA Programming Model CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

CUDA Programming Model CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

CUDA Programming Model CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

CUDA Programming Model CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

CUDA Programming Model CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

CUDA Programming Model CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A GPU program launches a kernel. Kernels are specified by the number of thread blocks, and the threads per block. Kernel Blocks Threads

CUDA Programming Model CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model Kernels are processed SIMD Each thread acts on different data Threads determine data using blockDim, blockIdx, threadIdx Note: a GPU thread is not an OS thread! We’ll call OS threads “tasks” __global__ void vecAdd(int *A, int *B, int *C) { int i = blockDim.x * blockIdx.x + threadIdx.x; C[i] = A[i] + B[i]; } Kernel Blocks Threads

Ordering of GPU Operations CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Ordering of GPU Operations CUDA operations can be ordered by associating them with a stream. A stream is a FIFO queue of operations. But that’s all that NVIDIA tells us… Questions: Can GPU operations in different streams run concurrently? They “may”… How are GPU operations from different streams ordered? How do streams differ? Default to a single stream, the NULL stream.

Non-Goals vs. Goals We are not trying to: certify GPUs. CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Non-Goals vs. Goals We are not trying to: certify GPUs. perform timing analysis of GPU-using systems. improve utilization by modifying scheduling behavior. Our goal is to: discover rules of GPU scheduling needed to build a model for GPU execution.

Outline Motivation CUDA Fundamentals GPU Scheduling Rules Basic Rules Extended Rules Motivation Future Work Outline Motivation CUDA Fundamentals GPU Scheduling Rules Extensions to Rules Future work

Scheduling Rules Questions: CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Scheduling Rules Questions: Can GPU operations in different streams run concurrently? They “may”… How are GPU operations from different streams ordered? How do streams differ? Default to a single stream, the NULL stream. Goal: Provide rules governing GPU scheduling behavior. Consider only CPU tasks within one address space. Focus on user-defined streams.

Experimental Setup – NVIDIA Jetson TX2 CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – NVIDIA Jetson TX2 Kernels are executed on the execution engine (EE). The EE is made up of multiple streaming multiprocessors (SMs). Two SMs form EE One CE GPU programs also submit copy operations to the GPU. Copies are performed on a copy engine (CE).

Experimental Setup – Schedule Visualizations CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations 2048 total GPU threads available on each SM 128 cores per SM

Experimental Setup – Schedule Visualizations CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations Labeled by stream 1024 threads in block Arrow gives the time a kernel was submitted to the GPU Block start Block completion K1: 1 x 1024

Experimental Setup – Schedule Visualizations CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations These simple kernels spin for a configurable amount of time – guarantees consistent runtime K1: 2 x 1024

Experimental Setup – Schedule Visualizations CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations We say these blocks are assigned to the GPU Multiple blocks can run on a SM at one time K1: 3 x 1024

Experimental Setup – Schedule Visualizations CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations K1 is dispatched when at least one block is assigned K1 is now fully dispatched, as all blocks have been assigned K1: 6 x 1024

Experiment #1: single stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #1: single stream Kernels in the same stream should execute in FIFO order.

Experiment #1: single stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #1: single stream K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Kernels in the same stream should execute in FIFO order. Let’s try it! K1: 6 x 1024 K2: 2 x 512 It might not be clear to the audience how the config is working – this is a good opportunity to sell our open-source framework as well The dotted line indicates the time of the queue snapshot

Experiment #1: single stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #1: single stream K1: 4 K1: 5 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Kernels in the same stream should execute in FIFO order. Let’s try it! K1: 6 x 1024 K2: 2 x 512

Experiment #1: single stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #1: single stream K2: 0 K2: 1 GPU SM 0 SM 1 EE Queue Stream S1 K2 Kernels in the same stream should execute in FIFO order. Let’s try it! K1: 6 x 1024 K2: 2 x 512

Experiment #2: multiple streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams What if we submit kernels from multiple streams? The documentation says they may run concurrently…

Experiment #2: multiple streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512

Experiment #2: multiple streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512

Experiment #2: multiple streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K1: 4 K3: 0 K1: 5 K3: 1 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512 K3 is dispatched before K2 because it is earlier in the EE queue

Experiment #2: multiple streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K1: 4 K3: 0 K1: 5 K3: 1 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512 K3 must have entered the EE queue while K1 was running

Experiment #2: multiple streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K2: 0 K2: 1 GPU SM 0 SM 1 EE Queue Stream S1 K2 Stream S2 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512 K2 does not enter the EE queue until K1 completes

Experiment #3: cut-ahead? CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? Can a kernel cut ahead of a partially-dispatched kernel?

Experiment #3: cut-ahead? CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512 K3 fits here

Experiment #3: cut-ahead? CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512 K3 fits here, but isn’t dispatched yet…

Experiment #3: cut-ahead? CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? Blocks 1,3 finished before 0,2 K3: 0 K3: 1 K1: 4 K1: 5 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512

Experiment #3: cut-ahead? CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K3: 0 K3: 1 K1: 4 K1: 5 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512 Let’s take a step back in time…

Experiment #3: cut-ahead? CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512 K3 K1

Experiment #3: cut-ahead? CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K3: 0 K3: 1 K1: 4 K1: 5 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512

Experiment #3: cut-ahead? CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K2: 0 K2: 1 GPU SM 0 SM 1 EE Queue Stream S1 K2 Stream S2 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512

Resource Requirements (4) CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Rules So Far General (4) G1: Kernels are enqueued on the associated stream queue. G2: A kernel is enqueued on the EE queue when it reaches the head of its stream queue. G3: A kernel at the head of the EE queue is dequeued from the EE queue when it becomes fully dispatched. G4: A kernel is dequeued from its stream queue once all of its blocks complete execution. X1: Only blocks of the kernel at the head of the EE queue are eligible to be assigned. R1: A block of the kernel at the head of the EE queue is eligible to be assigned only if its resource constraints are met. R2: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient thread resources available on some SM. R3: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient shared-memory resources available on some SM. Resource Requirements (4)

Resource Requirements (4) CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Rules So Far General (4) Copy Operations (4) G1: Kernels are enqueued on the associated stream queue. G2: A kernel is enqueued on the EE queue when it reaches the head of its stream queue. G3: A kernel at the head of the EE queue is dequeued from the EE queue when it becomes fully dispatched. G4: A kernel is dequeued from its stream queue once all of its blocks complete execution. X1: Only blocks of the kernel at the head of the EE queue are eligible to be assigned. R1: A block of the kernel at the head of the EE queue is eligible to be assigned only if its resource constraints are met. R2: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient thread resources available on some SM. R3: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient shared-memory resources available on some SM. C1: A copy operation is enqueued on the CE queue when it reaches the head of its stream queue. C2: A copy operation at the head of the CE queue is eligible to be assigned to the CE. C3: A copy operation at the head of the CE queue is dequeued from the CE queue once the copy is assigned to the CE on the GPU. C4: A copy operation is dequeued from its stream queue once the CE has completed the copy. Resource Requirements (4)

Full Experiment (see paper) CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Full Experiment (see paper)

Outline Motivation CUDA Fundamentals GPU Scheduling Rules Basic Rules Extended Rules Motivation Future Work Outline Motivation CUDA Fundamentals GPU Scheduling Rules Extensions to Rules Prioritized streams NULL stream Future work

Experiment #4: low-priority starvation CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation Can high priority kernels starve low-priority kernels?

Experiment #4: low-priority starvation CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 K2 High-Pri EE Queue Stream S3 K3 Can high priority kernels starve low-priority kernels? Low: K1: 8 x 1024 High: K2: 4 x 1024 K3: 4 x 1024

Experiment #4: low-priority starvation CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation K2: 0 K2: 3 K2: 1 K2: 2 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 K2 High-Pri EE Queue Stream S3 K3 Can high priority kernels starve low-priority kernels? Low: K1: 8 x 1024 High: K2: 4 x 1024 K3: 4 x 1024 K3

Experiment #4: low-priority starvation CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation K3: 0 K3: 3 K3: 1 K3: 2 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 High-Pri EE Queue Stream S3 K3 Can high priority kernels starve low-priority kernels? Low: K1: 8 x 1024 High: K2: 4 x 1024 K3: 4 x 1024

Experiment #4: low-priority starvation CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation K1: 5 K1: 7 K1: 4 K1: 6 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 High-Pri EE Queue Stream S3 Can high priority kernels starve low-priority kernels? Low: K1: 8 x 1024 High: K2: 4 x 1024 K3: 4 x 1024 K1 is starved by multiple higher-priority streams’ kernels

Experiment #5: NULL stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream How does the NULL stream interact with user-defined kernels?

Experiment #5: NULL stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream GPU SM 0 SM 1 Stream S1 Stream S2 EE Queue NULL Stream How does the NULL stream interact with user-defined kernels? User-defined streams and the NULL stream feed into one EE queue

Experiment #5: NULL stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream K1: 0 K1: 1 GPU SM 0 SM 1 Stream S1 K1 Stream S2 EE Queue NULL Stream How does the NULL stream interact with user-defined kernels? K1: 2 x 1024 K3: 2 x 1024 NULL: K2: 1 x 1024

Experiment #5: NULL stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream K1: 0 K1: 1 GPU SM 0 SM 1 Stream S1 K1 Stream S2 EE Queue NULL Stream K3 K2 How does the NULL stream interact with user-defined kernels? K1: 2 x 1024 K3: 2 x 1024 NULL: K2: 1 x 1024 K2 fits here, but isn’t dispatched yet… K3 is not in the EE queue

Experiment #5: NULL stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream K2: 0 GPU SM 0 SM 1 Stream S1 Stream S2 EE Queue NULL Stream K3 K2 How does the NULL stream interact with user-defined kernels? K1: 2 x 1024 K3: 2 x 1024 NULL: K2: 1 x 1024

Experiment #5: NULL stream CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream K3: 0 K3: 1 GPU SM 0 SM 1 Stream S1 Stream S2 EE Queue NULL Stream K3 How does the NULL stream interact with user-defined kernels? K1: 2 x 1024 K3: 2 x 1024 NULL: K2: 1 x 1024 K1 and K3 could have executed concurrently

Resource Requirements (4) CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Extended Rules General (4) Copy Operations (4) G1: Kernels are enqueued on the associated stream queue. G2: A kernel is enqueued on the EE queue when it reaches the head of its stream queue. G3: A kernel at the head of the EE queue is dequeued from the EE queue when it becomes fully dispatched. G4: A kernel is dequeued from its stream queue once all of its blocks complete execution. X1: Only blocks of the kernel at the head of the EE queue are eligible to be assigned. R1: A block of the kernel at the head of the EE queue is eligible to be assigned only if its resource constraints are met. R2: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient thread resources available on some SM. R3: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient shared-memory resources available on some SM. C1: A copy operation is enqueued on the CE queue when it reaches the head of its stream queue. C2: A copy operation at the head of the CE queue is eligible to be assigned to the CE. C3: A copy operation at the head of the CE queue is dequeued from the CE queue once the copy is assigned to the CE on the GPU. C4: A copy operation is dequeued from its stream queue once the CE has completed the copy. N1: A kernel Kk at the head of the NULL stream queue is enqueued on the EE queue when, for each other stream queue, either that queue is empty or the kernel at its head was launched after Kk. N2: A kernel Kk at the head of a non-NULL stream queue cannot be enqueued on the EE queue unless the NULL stream queue is either empty or the kernel at its head was launched after Kk. A1: A kernel can only be enqueued on the EE queue matching the priority of its stream. A2: A block of a kernel at the head of any EE queue is eligible to be assigned only if all higher-priority EE queues (priority-high over priority- low) are empty. Resource Requirements (4) NULL Stream (2) Prioritized Steams (2)

Outline Motivation CUDA Fundamentals GPU Scheduling Rules Basic Rules Extended Rules Motivation Future Work Outline Motivation CUDA Fundamentals GPU Scheduling Rules Extensions to Rules Future work

An API call caused the GPU to wait for a synchronization point CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Future Work We plan to extend our rules to include more complex behavior and explore sources of implicit synchronization. Why didn’t K3 and K4 run? An API call caused the GPU to wait for a synchronization point

CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Future Work We plan to extend our rules to include more complex behavior and explore sources of implicit synchronization. Our rules will lead to a new model for GPU program execution. Add citation to RTAS paper

Summary Contributions: Rules for GPU Execution CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Summary Contributions: Rules for GPU Execution Extended experimentation framework Next steps: Extend the rules for complex scenarios Investigate synchronization effects https://github.com/yalue/ cuda_scheduling_examiner_mirror

Experiment #4: prioritized streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams What happens with streams of different priorities?

Experiment #4: prioritized streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 Stream S2 High-Pri EE Queue What happens with streams of different priorities? There are multiple EE queues, one per priority level If not specified, the default is low

Experiment #4: prioritized streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 Stream S2 High-Pri EE Queue K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 K2 High-Pri EE Queue What happens with streams of different priorities? Low: K1: 8 x 1024 High: K2: 2 x 1024

Experiment #4: prioritized streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams K1: 4 K1: 5 K2: 0 K2: 1 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 K2 High-Pri EE Queue What happens with streams of different priorities? Low: K1: 8 x 1024 High: K2: 2 x 1024 K1 is preempted (between blocks) by K2

Experiment #4: prioritized streams CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams K1: 7 K2: 6 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 High-Pri EE Queue What happens with streams of different priorities? Low: K1: 8 x 1024 High: K2: 2 x 1024

NULL Stream Rules N1: NULL stream kernels wait for prior other kernels CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work NULL Stream Rules N1: NULL stream kernels wait for prior other kernels N2: Non-NULL stream kernels wait for NULL stream kernels Waiting here means not put on EE queue Change to all be one block size, as few blocks as possible K3 would have fit here, but K2 is a NULL stream kernel

NULL Stream Rules N1: NULL stream kernels wait for prior other kernels CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work NULL Stream Rules N1: NULL stream kernels wait for prior other kernels N2: Non-NULL stream kernels wait for NULL stream kernels Waiting here means not put on EE queue K5 would have fit here, but NULL stream kernels cannot run concurrently with others

CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work NULL Stream Rules N1: A kernel Kk at the head of the NULL stream queue is enqueued on the EE queue when, for each other stream queue, either that queue is empty or the kernel at its head was launched after Kk. N2: A kernel Kk at the head of a non-NULL stream queue cannot be enqueued on the EE queue unless the NULL stream queue is either empty or the kernel at its head was launched after Kk.

Prioritized Stream Rules CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Prioritized Stream Rules A1: EE queue matches priority level A2: GPU chooses from EE queues by priority There are multiple EE queues, one per priority level K1 is preempted (between blocks) by K2

Prioritized Stream Rules CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Prioritized Stream Rules A1: EE queue matches priority level A2: GPU chooses from EE queues by priority There are multiple EE queues, one per priority level Infinite starvation -> unbounded response time K1 is starved by multiple higher-priority streams’ kernels

Resource Requirements (4) CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Extended Rules General (4) Copy Operations (4) G1: Kernels are enqueued on the associated stream queue. G2: A kernel is enqueued on the EE queue when it reaches the head of its stream queue. G3: A kernel at the head of the EE queue is dequeued from the EE queue when it becomes fully dispatched. G4: A kernel is dequeued from its stream queue once all of its blocks complete execution. X1: Only blocks of the kernel at the head of the EE queue are eligible to be assigned. R1: A block of the kernel at the head of the EE queue is eligible to be assigned only if its resource constraints are met. R2: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient thread resources available on some SM. R3: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient shared-memory resources available on some SM. C1: A copy operation is enqueued on the CE queue when it reaches the head of its stream queue. C2: A copy operation at the head of the CE queue is eligible to be assigned to the CE. C3: A copy operation at the head of the CE queue is dequeued from the CE queue once the copy is assigned to the CE on the GPU. C4: A copy operation is dequeued from its stream queue once the CE has completed the copy. N1: A kernel Kk at the head of the NULL stream queue is enqueued on the EE queue when, for each other stream queue, either that queue is empty or the kernel at its head was launched after Kk. N2: A kernel Kk at the head of a non-NULL stream queue cannot be enqueued on the EE queue unless the NULL stream queue is either empty or the kernel at its head was launched after Kk. A1: A kernel can only be enqueued on the EE queue matching the priority of its stream. A2: A block of a kernel at the head of any EE queue is eligible to be assigned only if all higher-priority EE queues (priority-high over priority- low) are empty. Resource Requirements (4) NULL Stream (2) Prioritized Steams (2)