Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPU Scheduling on the NVIDIA TX2:

Similar presentations


Presentation on theme: "GPU Scheduling on the NVIDIA TX2:"— Presentation transcript:

1 GPU Scheduling on the NVIDIA TX2:
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, F. Donelson Smith University of North Carolina at Chapel Hill

2 Do we have any guarantees about execution order?
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Motivation Do we have any guarantees about execution order? Climate Control Steering Turtle Detection

3 CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work The Challenge Size, weight, and power (SWaP) constraints require embedded computing platforms, which limit processing power. We must keep utilization as high as possible.

4 CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work The Challenge Size, weight, and power (SWaP) constraints require embedded computing platforms, which limit processing power. We must keep utilization as high as possible. NVIDIA GPUs are treated as black boxes, but used in safety-critical applications. These devices must be certified, but we need a model for GPU execution that allows concurrent execution.

5 Outline Motivation CUDA Fundamentals GPU Scheduling Rules
Basic Rules Extended Rules Motivation Future Work Outline Motivation CUDA Fundamentals GPU Scheduling Rules Extensions to Rules Future work

6 CUDA Programming Model
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory input output CPU GPU

7 CUDA Programming Model
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

8 CUDA Programming Model
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

9 CUDA Programming Model
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

10 CUDA Programming Model
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

11 CUDA Programming Model
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A CUDA program’s 5 steps: Allocate GPU memory Copy data from CPU to GPU Launch kernel Copy results from GPU to CPU Free GPU memory CPU GPU

12 CUDA Programming Model
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model A GPU program launches a kernel. Kernels are specified by the number of thread blocks, and the threads per block. Kernel Blocks Threads

13 CUDA Programming Model
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work CUDA Programming Model Kernels are processed SIMD Each thread acts on different data Threads determine data using blockDim, blockIdx, threadIdx Note: a GPU thread is not an OS thread! We’ll call OS threads “tasks” __global__ void vecAdd(int *A, int *B, int *C) { int i = blockDim.x * blockIdx.x + threadIdx.x; C[i] = A[i] + B[i]; } Kernel Blocks Threads

14 Ordering of GPU Operations
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Ordering of GPU Operations CUDA operations can be ordered by associating them with a stream. A stream is a FIFO queue of operations. But that’s all that NVIDIA tells us… Questions: Can GPU operations in different streams run concurrently? They “may”… How are GPU operations from different streams ordered? How do streams differ? Default to a single stream, the NULL stream.

15 Non-Goals vs. Goals We are not trying to: certify GPUs.
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Non-Goals vs. Goals We are not trying to: certify GPUs. perform timing analysis of GPU-using systems. improve utilization by modifying scheduling behavior. Our goal is to: discover rules of GPU scheduling needed to build a model for GPU execution.

16 Outline Motivation CUDA Fundamentals GPU Scheduling Rules
Basic Rules Extended Rules Motivation Future Work Outline Motivation CUDA Fundamentals GPU Scheduling Rules Extensions to Rules Future work

17 Scheduling Rules Questions:
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Scheduling Rules Questions: Can GPU operations in different streams run concurrently? They “may”… How are GPU operations from different streams ordered? How do streams differ? Default to a single stream, the NULL stream. Goal: Provide rules governing GPU scheduling behavior. Consider only CPU tasks within one address space. Focus on user-defined streams.

18 Experimental Setup – NVIDIA Jetson TX2
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – NVIDIA Jetson TX2 Kernels are executed on the execution engine (EE). The EE is made up of multiple streaming multiprocessors (SMs). Two SMs form EE One CE GPU programs also submit copy operations to the GPU. Copies are performed on a copy engine (CE).

19 Experimental Setup – Schedule Visualizations
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations 2048 total GPU threads available on each SM 128 cores per SM

20 Experimental Setup – Schedule Visualizations
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations Labeled by stream 1024 threads in block Arrow gives the time a kernel was submitted to the GPU Block start Block completion K1: 1 x 1024

21 Experimental Setup – Schedule Visualizations
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations These simple kernels spin for a configurable amount of time – guarantees consistent runtime K1: 2 x 1024

22 Experimental Setup – Schedule Visualizations
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations We say these blocks are assigned to the GPU Multiple blocks can run on a SM at one time K1: 3 x 1024

23 Experimental Setup – Schedule Visualizations
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experimental Setup – Schedule Visualizations K1 is dispatched when at least one block is assigned K1 is now fully dispatched, as all blocks have been assigned K1: 6 x 1024

24 Experiment #1: single stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #1: single stream Kernels in the same stream should execute in FIFO order.

25 Experiment #1: single stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #1: single stream K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Kernels in the same stream should execute in FIFO order. Let’s try it! K1: 6 x 1024 K2: 2 x 512 It might not be clear to the audience how the config is working – this is a good opportunity to sell our open-source framework as well The dotted line indicates the time of the queue snapshot

26 Experiment #1: single stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #1: single stream K1: 4 K1: 5 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Kernels in the same stream should execute in FIFO order. Let’s try it! K1: 6 x 1024 K2: 2 x 512

27 Experiment #1: single stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #1: single stream K2: 0 K2: 1 GPU SM 0 SM 1 EE Queue Stream S1 K2 Kernels in the same stream should execute in FIFO order. Let’s try it! K1: 6 x 1024 K2: 2 x 512

28 Experiment #2: multiple streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams What if we submit kernels from multiple streams? The documentation says they may run concurrently…

29 Experiment #2: multiple streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512

30 Experiment #2: multiple streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512

31 Experiment #2: multiple streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K1: 4 K3: 0 K1: 5 K3: 1 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512 K3 is dispatched before K2 because it is earlier in the EE queue

32 Experiment #2: multiple streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K1: 4 K3: 0 K1: 5 K3: 1 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512 K3 must have entered the EE queue while K1 was running

33 Experiment #2: multiple streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #2: multiple streams K2: 0 K2: 1 GPU SM 0 SM 1 EE Queue Stream S1 K2 Stream S2 What if we submit kernels from multiple streams? The documentation says they may run concurrently… K1: 6 x 1024 K2: 2 x 512 K3: 2 x 512 K2 does not enter the EE queue until K1 completes

34 Experiment #3: cut-ahead?
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? Can a kernel cut ahead of a partially-dispatched kernel?

35 Experiment #3: cut-ahead?
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512 K3 fits here

36 Experiment #3: cut-ahead?
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512 K3 fits here, but isn’t dispatched yet…

37 Experiment #3: cut-ahead?
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? Blocks 1,3 finished before 0,2 K3: 0 K3: 1 K1: 4 K1: 5 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512

38 Experiment #3: cut-ahead?
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K3: 0 K3: 1 K1: 4 K1: 5 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512 Let’s take a step back in time…

39 Experiment #3: cut-ahead?
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512 K3 K1

40 Experiment #3: cut-ahead?
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K3: 0 K3: 1 K1: 4 K1: 5 GPU SM 0 SM 1 EE Queue Stream S1 K1 K2 Stream S2 K3 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512

41 Experiment #3: cut-ahead?
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #3: cut-ahead? K2: 0 K2: 1 GPU SM 0 SM 1 EE Queue Stream S1 K2 Stream S2 Can a kernel cut ahead of a partially- dispatched kernel? K1: 6 x 768 K2: 2 x 512 K3: 2 x 512

42 Resource Requirements (4)
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Rules So Far General (4) G1: Kernels are enqueued on the associated stream queue. G2: A kernel is enqueued on the EE queue when it reaches the head of its stream queue. G3: A kernel at the head of the EE queue is dequeued from the EE queue when it becomes fully dispatched. G4: A kernel is dequeued from its stream queue once all of its blocks complete execution. X1: Only blocks of the kernel at the head of the EE queue are eligible to be assigned. R1: A block of the kernel at the head of the EE queue is eligible to be assigned only if its resource constraints are met. R2: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient thread resources available on some SM. R3: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient shared-memory resources available on some SM. Resource Requirements (4)

43 Resource Requirements (4)
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Rules So Far General (4) Copy Operations (4) G1: Kernels are enqueued on the associated stream queue. G2: A kernel is enqueued on the EE queue when it reaches the head of its stream queue. G3: A kernel at the head of the EE queue is dequeued from the EE queue when it becomes fully dispatched. G4: A kernel is dequeued from its stream queue once all of its blocks complete execution. X1: Only blocks of the kernel at the head of the EE queue are eligible to be assigned. R1: A block of the kernel at the head of the EE queue is eligible to be assigned only if its resource constraints are met. R2: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient thread resources available on some SM. R3: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient shared-memory resources available on some SM. C1: A copy operation is enqueued on the CE queue when it reaches the head of its stream queue. C2: A copy operation at the head of the CE queue is eligible to be assigned to the CE. C3: A copy operation at the head of the CE queue is dequeued from the CE queue once the copy is assigned to the CE on the GPU. C4: A copy operation is dequeued from its stream queue once the CE has completed the copy. Resource Requirements (4)

44 Full Experiment (see paper)
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Full Experiment (see paper)

45 Outline Motivation CUDA Fundamentals GPU Scheduling Rules
Basic Rules Extended Rules Motivation Future Work Outline Motivation CUDA Fundamentals GPU Scheduling Rules Extensions to Rules Prioritized streams NULL stream Future work

46 Experiment #4: low-priority starvation
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation Can high priority kernels starve low-priority kernels?

47 Experiment #4: low-priority starvation
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 K2 High-Pri EE Queue Stream S3 K3 Can high priority kernels starve low-priority kernels? Low: K1: 8 x 1024 High: K2: 4 x 1024 K3: 4 x 1024

48 Experiment #4: low-priority starvation
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation K2: 0 K2: 3 K2: 1 K2: 2 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 K2 High-Pri EE Queue Stream S3 K3 Can high priority kernels starve low-priority kernels? Low: K1: 8 x 1024 High: K2: 4 x 1024 K3: 4 x 1024 K3

49 Experiment #4: low-priority starvation
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation K3: 0 K3: 3 K3: 1 K3: 2 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 High-Pri EE Queue Stream S3 K3 Can high priority kernels starve low-priority kernels? Low: K1: 8 x 1024 High: K2: 4 x 1024 K3: 4 x 1024

50 Experiment #4: low-priority starvation
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: low-priority starvation K1: 5 K1: 7 K1: 4 K1: 6 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 High-Pri EE Queue Stream S3 Can high priority kernels starve low-priority kernels? Low: K1: 8 x 1024 High: K2: 4 x 1024 K3: 4 x 1024 K1 is starved by multiple higher-priority streams’ kernels

51 Experiment #5: NULL stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream How does the NULL stream interact with user-defined kernels?

52 Experiment #5: NULL stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream GPU SM 0 SM 1 Stream S1 Stream S2 EE Queue NULL Stream How does the NULL stream interact with user-defined kernels? User-defined streams and the NULL stream feed into one EE queue

53 Experiment #5: NULL stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream K1: 0 K1: 1 GPU SM 0 SM 1 Stream S1 K1 Stream S2 EE Queue NULL Stream How does the NULL stream interact with user-defined kernels? K1: 2 x K3: 2 x 1024 NULL: K2: 1 x 1024

54 Experiment #5: NULL stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream K1: 0 K1: 1 GPU SM 0 SM 1 Stream S1 K1 Stream S2 EE Queue NULL Stream K3 K2 How does the NULL stream interact with user-defined kernels? K1: 2 x K3: 2 x 1024 NULL: K2: 1 x 1024 K2 fits here, but isn’t dispatched yet… K3 is not in the EE queue

55 Experiment #5: NULL stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream K2: 0 GPU SM 0 SM 1 Stream S1 Stream S2 EE Queue NULL Stream K3 K2 How does the NULL stream interact with user-defined kernels? K1: 2 x K3: 2 x 1024 NULL: K2: 1 x 1024

56 Experiment #5: NULL stream
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #5: NULL stream K3: 0 K3: 1 GPU SM 0 SM 1 Stream S1 Stream S2 EE Queue NULL Stream K3 How does the NULL stream interact with user-defined kernels? K1: 2 x K3: 2 x 1024 NULL: K2: 1 x 1024 K1 and K3 could have executed concurrently

57 Resource Requirements (4)
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Extended Rules General (4) Copy Operations (4) G1: Kernels are enqueued on the associated stream queue. G2: A kernel is enqueued on the EE queue when it reaches the head of its stream queue. G3: A kernel at the head of the EE queue is dequeued from the EE queue when it becomes fully dispatched. G4: A kernel is dequeued from its stream queue once all of its blocks complete execution. X1: Only blocks of the kernel at the head of the EE queue are eligible to be assigned. R1: A block of the kernel at the head of the EE queue is eligible to be assigned only if its resource constraints are met. R2: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient thread resources available on some SM. R3: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient shared-memory resources available on some SM. C1: A copy operation is enqueued on the CE queue when it reaches the head of its stream queue. C2: A copy operation at the head of the CE queue is eligible to be assigned to the CE. C3: A copy operation at the head of the CE queue is dequeued from the CE queue once the copy is assigned to the CE on the GPU. C4: A copy operation is dequeued from its stream queue once the CE has completed the copy. N1: A kernel Kk at the head of the NULL stream queue is enqueued on the EE queue when, for each other stream queue, either that queue is empty or the kernel at its head was launched after Kk. N2: A kernel Kk at the head of a non-NULL stream queue cannot be enqueued on the EE queue unless the NULL stream queue is either empty or the kernel at its head was launched after Kk. A1: A kernel can only be enqueued on the EE queue matching the priority of its stream. A2: A block of a kernel at the head of any EE queue is eligible to be assigned only if all higher-priority EE queues (priority-high over priority- low) are empty. Resource Requirements (4) NULL Stream (2) Prioritized Steams (2)

58 Outline Motivation CUDA Fundamentals GPU Scheduling Rules
Basic Rules Extended Rules Motivation Future Work Outline Motivation CUDA Fundamentals GPU Scheduling Rules Extensions to Rules Future work

59 An API call caused the GPU to wait for a synchronization point
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Future Work We plan to extend our rules to include more complex behavior and explore sources of implicit synchronization. Why didn’t K3 and K4 run? An API call caused the GPU to wait for a synchronization point

60 CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Future Work We plan to extend our rules to include more complex behavior and explore sources of implicit synchronization. Our rules will lead to a new model for GPU program execution. Add citation to RTAS paper

61 Summary Contributions: Rules for GPU Execution
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Summary Contributions: Rules for GPU Execution Extended experimentation framework Next steps: Extend the rules for complex scenarios Investigate synchronization effects cuda_scheduling_examiner_mirror

62

63 Experiment #4: prioritized streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams What happens with streams of different priorities?

64 Experiment #4: prioritized streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 Stream S2 High-Pri EE Queue What happens with streams of different priorities? There are multiple EE queues, one per priority level If not specified, the default is low

65 Experiment #4: prioritized streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 Stream S2 High-Pri EE Queue K1: 0 K1: 2 K1: 1 K1: 3 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 K2 High-Pri EE Queue What happens with streams of different priorities? Low: K1: 8 x 1024 High: K2: 2 x 1024

66 Experiment #4: prioritized streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams K1: 4 K1: 5 K2: 0 K2: 1 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 K2 High-Pri EE Queue What happens with streams of different priorities? Low: K1: 8 x 1024 High: K2: 2 x 1024 K1 is preempted (between blocks) by K2

67 Experiment #4: prioritized streams
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Experiment #4: prioritized streams K1: 7 K2: 6 GPU SM 0 SM 1 Low-Pri EE Queue Stream S1 K1 Stream S2 High-Pri EE Queue What happens with streams of different priorities? Low: K1: 8 x 1024 High: K2: 2 x 1024

68 NULL Stream Rules N1: NULL stream kernels wait for prior other kernels
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work NULL Stream Rules N1: NULL stream kernels wait for prior other kernels N2: Non-NULL stream kernels wait for NULL stream kernels Waiting here means not put on EE queue Change to all be one block size, as few blocks as possible K3 would have fit here, but K2 is a NULL stream kernel

69 NULL Stream Rules N1: NULL stream kernels wait for prior other kernels
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work NULL Stream Rules N1: NULL stream kernels wait for prior other kernels N2: Non-NULL stream kernels wait for NULL stream kernels Waiting here means not put on EE queue K5 would have fit here, but NULL stream kernels cannot run concurrently with others

70 CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work NULL Stream Rules N1: A kernel Kk at the head of the NULL stream queue is enqueued on the EE queue when, for each other stream queue, either that queue is empty or the kernel at its head was launched after Kk. N2: A kernel Kk at the head of a non-NULL stream queue cannot be enqueued on the EE queue unless the NULL stream queue is either empty or the kernel at its head was launched after Kk.

71 Prioritized Stream Rules
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Prioritized Stream Rules A1: EE queue matches priority level A2: GPU chooses from EE queues by priority There are multiple EE queues, one per priority level K1 is preempted (between blocks) by K2

72 Prioritized Stream Rules
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Prioritized Stream Rules A1: EE queue matches priority level A2: GPU chooses from EE queues by priority There are multiple EE queues, one per priority level Infinite starvation -> unbounded response time K1 is starved by multiple higher-priority streams’ kernels

73 Resource Requirements (4)
CUDA Fundamentals Basic Rules Extended Rules Motivation Future Work Extended Rules General (4) Copy Operations (4) G1: Kernels are enqueued on the associated stream queue. G2: A kernel is enqueued on the EE queue when it reaches the head of its stream queue. G3: A kernel at the head of the EE queue is dequeued from the EE queue when it becomes fully dispatched. G4: A kernel is dequeued from its stream queue once all of its blocks complete execution. X1: Only blocks of the kernel at the head of the EE queue are eligible to be assigned. R1: A block of the kernel at the head of the EE queue is eligible to be assigned only if its resource constraints are met. R2: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient thread resources available on some SM. R3: A block of the kernel at the head of the EE queue is eligible to be assigned only if there are sufficient shared-memory resources available on some SM. C1: A copy operation is enqueued on the CE queue when it reaches the head of its stream queue. C2: A copy operation at the head of the CE queue is eligible to be assigned to the CE. C3: A copy operation at the head of the CE queue is dequeued from the CE queue once the copy is assigned to the CE on the GPU. C4: A copy operation is dequeued from its stream queue once the CE has completed the copy. N1: A kernel Kk at the head of the NULL stream queue is enqueued on the EE queue when, for each other stream queue, either that queue is empty or the kernel at its head was launched after Kk. N2: A kernel Kk at the head of a non-NULL stream queue cannot be enqueued on the EE queue unless the NULL stream queue is either empty or the kernel at its head was launched after Kk. A1: A kernel can only be enqueued on the EE queue matching the priority of its stream. A2: A block of a kernel at the head of any EE queue is eligible to be assigned only if all higher-priority EE queues (priority-high over priority- low) are empty. Resource Requirements (4) NULL Stream (2) Prioritized Steams (2)


Download ppt "GPU Scheduling on the NVIDIA TX2:"

Similar presentations


Ads by Google