Presentation is loading. Please wait.

Presentation is loading. Please wait.

CUDA Execution Model - II

Similar presentations


Presentation on theme: "CUDA Execution Model - II"— Presentation transcript:

1 CUDA Execution Model - II
©Sudhakar Yalamanchili unless otherwise noted

2 Objectives Understand the sequencing of instructions in a kernel through a set of scalar pipelines (width =warp) Thread blocks and warps Synchronization Mapping to hardware execution units Scheduling and resource management Programmer’s view Have a basic, high level understanding of instruction fetch, decode, issue, and memory access

3 Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3 CUDA Programming Guide Get the CUDA Occupancy Calculator Find at nvidia.com Use web version at

4 Schedule onto multiprocessors
Recap __host__ Void vecAdd() { dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n); } __global__ void vecAddKernel(float *A_d, float *B_d, float *C_d, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C_d[i] = A_d[i]+B_d[i]; } Kernel Blk 0 Blk N-1 • • • GPU M0 RAM Mk • • • Schedule onto multiprocessors © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

5 Kernel Launch Commands by host issued through streams
CUDA streams Kernel Management Unit (Device) Host Processor Kernel dispatch to SMs HW Queues Commands by host issued through streams Kernels in the same stream executed sequentially Kernels in different streams may be executed concurrently Streams are mapped to hardware queues in the device in the kernel management unit (KMU) (more later) Multiple streams mapped to each queue  serializes some kernels Kernel launch distributes thread blocks to SMs

6 CUDA Thread Block CUDA Thread Block
All threads in a block execute the same kernel program (SPMD) Also referred to as cooperative thread arrays (CTAs) Programmer declares block: Block size 1 to 1024 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads Threads have thread index numbers within block Kernel code uses thread index and block index to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work CUDA Thread Block Thread Id #: … m Thread program Point out analogy with OpenCL. Baseline Microarchitecture Courtesy: John Nickolls, NVIDIA © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

7 Execution Semantics Thread Block warp barrier synch 2D Grid Vector example Execution of a thread block is partitioned into warps

8 NVIDIA GK110 (Keplar) Thread Block Scheduler
Image from

9 SMX Organization Multiple Warp Schedulers 64K 32-bit registers
192 cores – 6 clusters of 32 cores each Warp size Each thread block is allocated a slice Image from

10 Execution Sequencing in a Single SM
Warp 6 Warp 1 Warp 4 Decode RF Predicate RF D-Cache Data All Hit? Writeback Pending Warps on cache misses Pipeline scalar pipeline Issue I-Buffer I-Fetch Miss? Single instruction stream Cycle level view of the execution of a thread block #lanes = warp size

11 TBs, Warps, & Scheduling Imagine one TB has 64 threads or 2 warps
On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels Thread Blocks …… SMXs Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

12 TBs, Warps, & Utilization
One TB has 64 threads or 2 warps Microarchitecture structures to track thread blocks and warps Thread Blocks …… SMXs Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

13 TBs, Warps, & Utilization
One TB has 64 threads or 2 warps Limits on #thread blocks/SM Thread Blocks …… SMXs Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

14 TBs, Warps, & Utilization
One TB has 64 threads or 2 warps Thread Blocks Remaining TBs are queued …… SMXs Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

15 Example: VectorAdd on GPU
setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret; PTX (Assembly): CUDA: __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; }

16 Example: Vector Code PTX (Assembly):
setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret; PTX (Assembly): warp dispatch pipeline scalar pipeline scalar pipeline scalar pipeline scalar Caution:warp synchronous programming

17 Example: VectorAdd on GPU
N=8, 8 Threads, 1 block, warp size = 4 1 SM, 4 Cores Pipeline: Fetch: One instruction from each warp Round-robin through all warps Execution: In-order execution within warps With proper data forwarding 1 Cycle each stage How many warps?

18 Execution Sequence Warp0 Warp1
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Single instruction stream front-end FE DE EXE EXE EXE EXE Synchronous parallel instuction back-end MEM MEM MEM MEM WB WB WB WB Warp0 Warp1 Note: Assume idealized execution – issues of dependencies/hazards will be addressed later. The point here is to understand the basic sequencing of instructions in SIMT execution.

19 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; setp W0 FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

20 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; setp W1 FE setp W0 DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

21 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra W0 FE setp W1 DE setp W0 setp W0 setp W0 setp W0 EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

22 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; @p bra W1 FE @p bra W0 DE setp W1 EXE EXE EXE EXE setp W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

23 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra L2 FE @p bra W1 DE bra W0 EXE EXE EXE EXE setp W1 MEM MEM MEM MEM setp W0 WB WB WB WB Warp0 Warp1

24 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra L2 FE DE bra W1 EXE EXE EXE EXE bra W0 MEM MEM MEM MEM setp W1 WB WB WB WB Warp0 Warp1

25 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W0 FE DE EXE EXE EXE EXE bra W1 MEM MEM MEM MEM bra W0 WB WB WB WB Warp0 Warp1

26 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W1 FE ld W0 DE EXE EXE EXE EXE MEM MEM MEM MEM bra W1 WB WB WB WB Warp0 Warp1

27 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W0 FE ld W1 DE ld W0 EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

28 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W1 FE ld W0 DE ld W1 EXE EXE EXE EXE ld W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

29 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; add W0 FE ld W1 DE ld W0 EXE EXE EXE EXE ld W1 MEM MEM MEM MEM ld W0 WB WB WB WB Warp0 Warp1

30 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; add W1 FE add W0 DE ld W1 EXE EXE EXE EXE ld W0 MEM MEM MEM MEM ld W1 WB WB WB WB Warp0 Warp1

31 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; st W0 FE add W1 DE add W0 EXE EXE EXE EXE ld W1 MEM MEM MEM MEM ld W0 WB WB WB WB Warp0 Warp1

32 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; st W1 FE st W0 DE add W1 EXE EXE EXE EXE add W0 MEM MEM MEM MEM ld W1 WB WB WB WB Warp0 Warp1

33 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ret FE st W1 DE st W0 EXE EXE EXE EXE add W1 MEM MEM MEM MEM add W0 WB WB WB WB Warp0 Warp1

34 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ret FE ret DE st W1 EXE EXE EXE EXE st W0 MEM MEM MEM MEM add W1 WB WB WB WB Warp0 Warp1

35 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE ret DE ret EXE EXE EXE EXE st W1 MEM MEM MEM MEM st W0 WB WB WB WB Warp0 Warp1

36 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE ret EXE EXE EXE EXE ret MEM MEM MEM MEM st W1 WB WB WB WB Warp0 Warp1

37 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE ret MEM MEM MEM MEM ret WB WB WB WB Warp0 Warp1

38 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM ret WB WB WB WB Warp0 Warp1

39 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1 Note: Idealized execution without memory or special function delays

40 How thread blocks are partitioned
Partitioning a thread block Thread IDs within a warp are consecutive and increasing Warp 0 starts with Thread ID 0 Partitioning is always the same Thus you can use this knowledge in control flow However, the exact size of warps may change from generation to generation However, DO NOT rely on any ordering between warps If there are any dependencies between threads, you must __syncthreads() to get correct results © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

41 Execution Sequence (Cont.)
SM implements zero-overhead warp scheduling At any time, 1 or 2 of the warps is executed by SM Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

42 Fine Grained Multithreading
First introduced in the Denelcor HEP (1980’s) Can eliminate data bypassing in an instruction stream Maintain separation between successive instructions in a warp Simplifies dependency between successive instructions E.g., have only one instruction in the pipeline at a time What happens when warp size > #lanes? Why have such a relationship? All data structures and decisions based on warp size will be affected. For example schedulers will have more time to make scheduling decisions while I-buffer logic can be reduced.

43 Issue the warp over multiple cycles
Multi-Cycle Dispatch warp Cycle 2 One warp dispatch Cycle 1 pipeline scalar pipeline scalar pipeline scalar pipeline scalar dispatch pipeline scalar pipeline scalar pipeline scalar pipeline scalar NB: Fetch BW Issue the warp over multiple cycles

44 Potential Use Cases: Multicycle Dispatch
Reliability: Dynamically reduced the #lanes due to faults E.g., mask off 4 lanes in an 8 lane warp. Compiled code can still execute correctly with appropriate microarchitecture support Power: Dynamically reduce the number of lanes to reduce power consumption Power gate 4 lanes in an 8 lane warp Make warp level compiler/algorithm optimizations portable across machines with different warp sizes Analogy with dynamically scheduled ILP machines No known machines that implement this at the moment The point here is to motivate multi-cycle dispatch from the perspective of how we can use abstractions effectively to decouple hardware and software. Grids and TBs do this statically. Most of the motivation comes from use cases that wish to do this dynamically

45 Control Flow Instructions
Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths are serialized in current GPUs The control paths taken by the threads in a warp are traversed one at a time until there is no more. More on control divergence later WARP_SIZE High level divergence example Design-time reorganization of threads into non-divergent groups © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

46 Barrier Synchronization
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread N-3 Thread N-2 Thread N-1 Time Impact on scalability No constraints on execution order of thread blocks Impact on portability Resource allocation on a per thread block basis Launch thread blocks © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

47 Synchronization All threads must reach the barrier, otherwise indefinite wait Note challenges with conditional statements Note each barrier statement represents a distinct barrier Barriers only within a thread block No synchronization across thread blocks Can be enforced via atomics Thread blocks can execute in any order Loose synchronization can enhance scalability CUDA 9.0 introduces more flexible synchronization model Define and synchronize groups of threads

48 Transparent Scalability
Hardware is free to assign blocks to any processor at any time A kernel scales across any number of parallel processors Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Kernel grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device Block 0 Block 1 Block 2 Block 3 time Block 4 Block 5 Block 6 Block 7 Each block can execute in any order relative to other blocks. © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

49 SIMD vs. SIMT Flynn Taxonomy e.g., SSE/AVX + SISD SIMD MISD MIMD SIMT
Single Scalar Thread Flynn Taxonomy Register File e.g., SSE/AVX + Data Streams SISD SIMD Synchronous operation Instruction Streams MISD MIMD RF RF RF RF Loosely synchronized threads SIMT Multiple threads e.g., pthreads e.g., PTX, HSA

50 Comparison Multiple independent threads and explicit synchronization
MIMD SIMD SIMT RF Register File + Multiple independent threads and explicit synchronization TLP, DLP, ILP, SPMD Single thread + vector operations DLP Easily embedded in sequential code Multiple, synchronous threads TLP, DLP, ILP (more recent) Independent control flow per thread

51 Performance Metrics: Occupancy
Performance implications of programming model properties Warps, thread blocks, register usage Capture which resources can be dynamically shared, and how to reason about resource demands of a CUDA kernel Enable device-specific online tuning of kernel parameters

52 CUDA Occupancy Occupancy = (#Active Warps) /(#MaximumActive Warps)
Measure of how well you are using max capacity Limits on the numerator? Registers/thread Shared memory/thread block Number of scheduling slots: thread blocks or warps Limits on the denominator? Memory bandwidth Scheduler slots What is the performance impact of varying kernel resource demands?

53 Performance Goals Core Utilization BW Utilization
What are the limiting Factors?

54 Resource Limits on Occupancy
Warp Schedulers Warp Context Register File SP TB 0 Thread Block Control L1/Shared Memory Kernel Distributor Limits the #thread blocks SM Scheduler Limits the #threads SM SM SM SM DRAM Limits the #threads SM – Stream Multiprocessor Limits the #thread blocks SP – Stream Processor

55 Programming Model Attributes
Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) (0,1,3) (0,1,0) (0,1,1) (0,1,2) (0,0,1) (0,0,2) (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3) #SMs Max Threads/Block Max Threads/Dim Warp size What do we need to know about target processor? How do these map to kernel configuration parameters that we can control?

56 Impact of Thread Block Size
Consider Fermi with 1536 threads/SM With 512 threads/block, can only up to 3 thread blocks executing at an instant With 128 threads/block  12 thread blocks per SM Consider how many instructions can be in flight? Consider limit of 8 thread blocks/SM? Only 1024 active threads at a time Occupancy = 0.666 To maximize utilization, thread block size should balance demand for thread blocks vs. thread slots

57 Block Granularity Considerations
For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks For 8X8, we have 64 threads per Block. Since each SM can take up to 1536 threads, there are 24 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 16X16, we have 256 threads per Block. Since each SM can take up to 1536 threads, it can take up to 6 Blocks and achieve full capacity unless other resource considerations overrule. For 32X32, we would have 1024 threads per Block. Only one block can fit into an SM for Fermi. Using only 2/3 of the thread capacity of an SM. Also, this works for CUDA 3.0 and beyond but too large for some early CUDA versions. © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

58 Impact of #Registers Per Thread
Assume 10 registers/thread and a thread block size of 256 Number of registers per SM = 16K A thread block requires 2560 registers for a maximum of 6 thread blocks per SM Uses all 1536 thread slots What is the impact of increasing number of registers by 2? Granularity of management is a thread block! Loss of concurrency of 256 threads!

59 Impact of Shared Memory
Shared memory is allocated per thread block Can limit the number of thread blocks executing concurrently per SM As we change the gridDim and blockDim parameters how does demand change for shared memory, number of thread slots, or number of thread block slots?

60 Merge these two threads
Thread Granularity d_N d_M d_P How fine grained should threads be? Coarser grain threads  Increased register pressure, shared memory pressure Lower pressure on thread block slots and thread slots Merge threads blocks Share row values Reduce memory BW Merge these two threads

61 Shared memory/Thread block
Balance #Threads/Block #Thread Blocks Shared memory/Thread block #Registers/Thread Navigate the tradeoffs to maximize core utilization and memory bandwidth utilization for the target device Goal: Increase occupancy until one or the other is saturated

62 Performance Tuning Auto-tuning to maximize performance for a device
Query device properties to tune kernel parameters prior to launch Tune kernel properties to maximize efficiency of execution _1/rel/toolkit/docs/online/index.html

63 Querying the Device Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) (0,1,3) (0,1,0) (0,1,1) (0,1,2) (0,0,1) (0,0,2) (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3) Register File (128 KB) L1 (16 KB) Shared Memory (48 KB) Get device properties #SMs Max Threads/Block Max Threads/Dim Warp size Give an example of when and how you might do this Tune kernel properties to maximize efficiency of execution © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

64 Device Properties

65 Summary Sequence warps through the scalar pipelines
Overlap execution from multiple warps to hide memory latency (or special functions) Out of order execution not shown  need scoreboard for correctness Occupancy calculator as a first order estimate of correctness

66 Questions?


Download ppt "CUDA Execution Model - II"

Similar presentations


Ads by Google