Presentation is loading. Please wait.

Presentation is loading. Please wait.

(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

Similar presentations


Presentation on theme: "(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted."— Presentation transcript:

1 (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted

2 (2) Objectives Understand the sequencing of instructions in a kernel through a set of scalar pipelines (width =warp) Have a basic, high level understanding of instruction fetch, decode, issue, and memory access

3 (3) Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3 CUDA Programming Guide  http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract Get the CUDA Occupancy Calculator  Find at nvidia.com  Use web version at http://lxkarthi.github.io/cuda-calculator/http://lxkarthi.github.io/cuda-calculator/ 3

4 (4) Recap __global__ void vecAddKernel(float *A_d, float *B_d, float *C_d, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C_d[i] = A_d[i]+B_d[i]; } __host__ Void vecAdd() { dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecAddKernel >> (A_d,B_d,C_d,n); } Kernel Blk 0 Blk N-1 GPU M0 RAM Mk Schedule onto multiprocessors © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

5 (5) Kernel Launch CUDA streams Host Processor HW Queues Kernel Management Unit (Device) Commands by host issued through streams  Kernels in the same stream executed sequentially  Kernels in different streams may be executed concurrently Streams are mapped to hardware queues in the device in the kernel management unit (KMU)  Multiple streams mapped to each queue  serializes some kernels Kernel launch distributes thread blocks to SMs Kernel dispatch to SMs

6 (6) TBs, Warps, & Scheduling Imagine one TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels 6 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs …… SMX0SMX1SMX12 Thread Blocks

7 (7) TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels 7 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs …… SMX0SMX1SMX12 Thread Blocks

8 (8) TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels 8 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs …… SMX0SMX1SMX12 Thread Blocks

9 (9) TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels 9 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs …… SMX0SMX1SMX12 Thread Blocks Remaining TBs are queued

10 (10) NVIDIA GK110 (Keplar) Thread Block Scheduler Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

11 (11) SMX Organization Multiple Warp Schedulers 192 cores – 6 clusters of 32 cores each 64K 32-bit registers Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

12 (12) Execution Sequencing in a Single SM Cycle level view of the execution of a GPU kernel Warp 6 Warp 1 Warp 2 Decode RF PRF D-Cache Data All Hit? Writeback Pending Warps scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer I-Fetch Miss? Execution Sequencing?

13 (13) Example: VectorAdd on GPU __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; } setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret; PTX (Assembly): CUDA:

14 (14) Example: VectorAdd on GPU N=8, 8 Threads, 1 block, warp size = 4 1 SM, 4 Cores Pipeline:  Fetch: oOne instruction from each warp oRound-robin through all warps  Execution: oIn-order execution within warps  With proper data forwarding  1 Cycle each stage How many warps?

15 (15) FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 Execution Sequence

16 (16) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 setp W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB Execution Sequence (cont.)

17 (17) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 setp W1 setp W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB Execution Sequence (cont.)

18 (18) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 bra W0 setp W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB setp W0 Execution Sequence (cont.)

19 (19) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 @p bra W1 @p bra W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB setp W1 setp W0 Execution Sequence (cont.)

20 (20) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 bra L2 @p bra W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB bra W0 setp W1 setp W0 Execution Sequence (cont.)

21 (21) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 bra L2 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB bra W1 bra W0 setp W1 Execution Sequence (cont.)

22 (22) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ld W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB bra W1 bra W0 Execution Sequence (cont.)

23 (23) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret ; Warp0Warp1 ld W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB bra W1 ld W0 Execution Sequence (cont.)

24 (24) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ld W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W1 ld W0 Execution Sequence (cont.)

25 (25) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ld W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W0 ld W1 ld W0 Execution Sequence (cont.)

26 (26) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 add W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W1 ld W0 Execution Sequence (cont.)

27 (27) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 add W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W1 ld W0 ld W1 add W0 Execution Sequence (cont.)

28 (28) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 st W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W0 ld W1 add W1 add W0 Execution Sequence (cont.)

29 (29) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 st W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W1 add W0 add W1 st W0 Execution Sequence (cont.)

30 (30) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ret FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB add W0 add W1 st W1 st W0 Execution Sequence (cont.)

31 (31) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ret FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB add W1 ret st W0 st W1 Execution Sequence (cont.)

32 (32) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ret FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB st W0 st W1 ret Execution Sequence (cont.)

33 (33) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB st W1 ret Execution Sequence (cont.)

34 (34) Warp0Warp1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ret setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Execution Sequence (cont.)

35 (35) Warp0Warp1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ret setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Execution Sequence (cont.)

36 (36) Warp0Warp1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Execution Sequence (cont.) Note: Idealized execution without memory or special function delays

37 (37) Fine Grained Multithreading First introduced in the Denelcor HEP (1980’s) Can eliminate data bypassing in an instruction stream  Maintain separation between successive instructions in a warp Simplifies dependency between successive instructions  E.g., have only one instruction in the pipeline at a time What happens when warp size > #lanes?  Why have such a relationship?

38 (38) Multi-Cycle Dispatch scalar pipeline scalar pipeline scalar pipeline scalar pipeline scalar pipeline scalar pipeline scalar pipeline scalar pipeline Cycle 1 Cycle 2 dispatch One warp warp Issue the warp over multiple cycles NB: Fetch BW

39 (39) SIMD vs. SIMT SISDSIMD MISDMIMD Data Streams Instruction Streams Register File + Loosely synchronized threads Multiple threads Synchronous operation RF Single Scalar Thread SIMT Flynn Taxonomy e.g., pthreads e.g., SSE/AVX e.g., PTX, HSA

40 (40) Comparison Register File + RF MIMDSIMD SIMT Multiple independent threads and explicit synchronization TLP, DLP, ILP, SPMD Single thread + vector operations DLP Easily embedded in sequential code Multiple, synchronous threads TLP, DLP, ILP (more recent) Explicitly parallel

41 (41) Performance Metrics: Occupancy Performance implications of programming model properties  Warps, thread blocks, register usage Capture which resources can be dynamically shared, and how to reason about resource demands of a CUDA kernel  Enable device-specific online tuning of kernel parameters 41

42 (42) CUDA Occupancy Occupancy = (#Active Warps) /(#MaximumActive Warps)  Measure of how well you are using max capacity Limits on the numerator?  Registers/thread  Shared memory/thread block  Number of scheduling slots: thread blocks or warps Limits on the denominator?  Memory bandwidth  Scheduler slots What is the performance impact of varying kernel resource demands?

43 (43) Performance Goals Core Utilization BW Utilization What are the limiting Factors?

44 (44) Resource Limits on Occupancy SM Scheduler Kernel Distributor SM DRAM Limits the #threads Limits the #thread blocks Warp Schedulers Warp Context Register File SP TB 0 Thread Block Control L1/Shared Memory Limits the #thread blocks Limits the #threads SM – Stream Multiprocessor SP – Stream Processor

45 (45) Programming Model Attributes Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) #SMs Max Threads/Block Max Threads/Dim Warp size What do we need to know about target processor? How do these map to kernel configuration parameters that we can control?

46 (46) Impact of Thread Block Size Consider Fermi with 1536 threads/SM  With 512 threads/block, can only up to 3 thread blocks executing at an instant  With 128 threads/block  12 thread blocks per SM  Consider how many instructions can be in flight? Consider limit of 8 thread blocks/SM?  Only 1024 active threads at a time  Occupancy = 0.666 To maximize utilization, thread block size should balance demand for thread blocks vs. thread slots

47 (47) Impact of #Registers Per Thread Assume 10 registers/thread and a thread block size of 256 Number of registers per SM = 16K A thread block requires 2560 registers for a maximum of 6 thread blocks per SM  Uses all 1536 thread slots What is the impact of increasing number of registers by 2?  Granularity of management is a thread block!  Loss of concurrency of 256 threads!

48 (48) Impact of Shared Memory Shared memory is allocated per thread block  Can limit the number of thread blocks executing concurrently per SM As we change the gridDim and blockDim parameters how does demand change for shared memory, number of thread slots, or number of thread block slots?

49 (49) Thread Granularity How fine grained should threads be?  Coarser grain threads  oIncreased register pressure, shared memory pressure oLower pressure on thread block slots and thread slots Merge threads blocks Share row values  Reduce memory BW d_N d_M d_P Merge these two threads

50 (50) Balance #Threads /Block #Thread Blocks Shared memory/ Thread block # Registers /Thread Navigate the tradeoffs to maximize core utilization and memory bandwidth utilization for the target device Goal: Increase occupancy until one or the other is saturated

51 (51) Performance Tuning Auto-tuning to maximize performance for a device Query device properties to tune kernel parameters prior to launch Tune kernel properties to maximize efficiency of execution http://developer.download.nvidia.com/compute/cuda/4 _1/rel/toolkit/docs/online/index.htmlhttp://developer.download.nvidia.com/compute/cuda/4 _1/rel/toolkit/docs/online/index.html

52 (52) Device Properties

53 (53) Summary Sequence warps through the scalar pipelines Overlap execution from multiple warps to hide memory latency (or special functions) Out of order execution not shown  need scoreboard for correctness Occupancy calculator as a first order estimate of correctness

54 © Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Questions?


Download ppt "(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted."

Similar presentations


Ads by Google