(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

Optimization on Kepler Zehuan Wang

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, , SSL 2014.

Sunpyo Hong, Hyesoon Kim

Operation of the SM Pipeline

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

My Coordinates Office EM G.27 contact time:

CUDA programming Performance considerations (CUDA best practices)

Single Instruction Multiple Threads

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

Mattan Erez The University of Texas at Austin

Lecture 5: GPU Compute Architecture for the last time

© David Kirk/NVIDIA and Wen-mei W. Hwu,

CUDA Parallelism Model

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

Programming Massively Parallel Processors Performance Considerations

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Operation of the Basic SM Pipeline

Mattan Erez The University of Texas at Austin

CUDA Execution Model – III Streams and Events

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

General Purpose Graphics Processing Units (GPGPUs)

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

CUDA Execution Model - II

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted

(2) Objectives Understand the sequencing of instructions in a kernel through a set of scalar pipelines (width =warp) Have a basic, high level understanding of instruction fetch, decode, issue, and memory access

(3) Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3 CUDA Programming Guide  Get the CUDA Occupancy Calculator  Find at nvidia.com  Use web version at 3

(4) Recap __global__ void vecAddKernel(float *A_d, float *B_d, float *C_d, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C_d[i] = A_d[i]+B_d[i]; } __host__ Void vecAdd() { dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecAddKernel >> (A_d,B_d,C_d,n); } Kernel Blk 0 Blk N-1 GPU M0 RAM Mk Schedule onto multiprocessors © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

(5) Kernel Launch CUDA streams Host Processor HW Queues Kernel Management Unit (Device) Commands by host issued through streams  Kernels in the same stream executed sequentially  Kernels in different streams may be executed concurrently Streams are mapped to hardware queues in the device in the kernel management unit (KMU)  Multiple streams mapped to each queue  serializes some kernels Kernel launch distributes thread blocks to SMs Kernel dispatch to SMs

(6) TBs, Warps, & Scheduling Imagine one TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels 6 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs …… SMX0SMX1SMX12 Thread Blocks

(7) TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels 7 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs …… SMX0SMX1SMX12 Thread Blocks

(8) TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels 8 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs …… SMX0SMX1SMX12 Thread Blocks

(9) TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels 9 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs …… SMX0SMX1SMX12 Thread Blocks Remaining TBs are queued

(10) NVIDIA GK110 (Keplar) Thread Block Scheduler Image from

(11) SMX Organization Multiple Warp Schedulers 192 cores – 6 clusters of 32 cores each 64K 32-bit registers Image from

(12) Execution Sequencing in a Single SM Cycle level view of the execution of a GPU kernel Warp 6 Warp 1 Warp 2 Decode RF PRF D-Cache Data All Hit? Writeback Pending Warps scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer I-Fetch Miss? Execution Sequencing?

(13) Example: VectorAdd on GPU __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; } setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret; PTX (Assembly): CUDA:

(14) Example: VectorAdd on GPU N=8, 8 Threads, 1 block, warp size = 4 1 SM, 4 Cores Pipeline:  Fetch: oOne instruction from each warp oRound-robin through all warps  Execution: oIn-order execution within warps  With proper data forwarding  1 Cycle each stage How many warps?

(15) FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 Execution Sequence

(16) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 setp W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB Execution Sequence (cont.)

(17) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 setp W1 setp W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB Execution Sequence (cont.)

(18) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 bra W0 setp W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB setp W0 Execution Sequence (cont.)

(19) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra bra W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB setp W1 setp W0 Execution Sequence (cont.)

(20) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 bra bra W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB bra W0 setp W1 setp W0 Execution Sequence (cont.)

(21) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 bra L2 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB bra W1 bra W0 setp W1 Execution Sequence (cont.)

(22) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ld W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB bra W1 bra W0 Execution Sequence (cont.)

(23) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret ; Warp0Warp1 ld W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB bra W1 ld W0 Execution Sequence (cont.)

(24) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ld W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W1 ld W0 Execution Sequence (cont.)

(25) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ld W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W0 ld W1 ld W0 Execution Sequence (cont.)

(26) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 add W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W1 ld W0 Execution Sequence (cont.)

(27) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 add W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W1 ld W0 ld W1 add W0 Execution Sequence (cont.)

(28) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 st W0 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W0 ld W1 add W1 add W0 Execution Sequence (cont.)

(29) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 st W1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ld W1 add W0 add W1 st W0 Execution Sequence (cont.)

(30) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ret FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB add W0 add W1 st W1 st W0 Execution Sequence (cont.)

(31) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ret FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB add W1 ret st W0 st W1 Execution Sequence (cont.)

(32) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 ret FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB st W0 st W1 ret Execution Sequence (cont.)

(33) setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Warp0Warp1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB st W1 ret Execution Sequence (cont.)

(34) Warp0Warp1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ret setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Execution Sequence (cont.)

(35) Warp0Warp1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB ret setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Execution Sequence (cont.)

(36) Warp0Warp1 FE DE EXE MEM WB EXE MEM WB EXE MEM WB EXE MEM WB setp.lt.s32 %p, %r5, bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; Execution Sequence (cont.) Note: Idealized execution without memory or special function delays

(37) Fine Grained Multithreading First introduced in the Denelcor HEP (1980’s) Can eliminate data bypassing in an instruction stream  Maintain separation between successive instructions in a warp Simplifies dependency between successive instructions  E.g., have only one instruction in the pipeline at a time What happens when warp size > #lanes?  Why have such a relationship?

(38) Multi-Cycle Dispatch scalar pipeline scalar pipeline scalar pipeline scalar pipeline scalar pipeline scalar pipeline scalar pipeline scalar pipeline Cycle 1 Cycle 2 dispatch One warp warp Issue the warp over multiple cycles NB: Fetch BW

(39) SIMD vs. SIMT SISDSIMD MISDMIMD Data Streams Instruction Streams Register File + Loosely synchronized threads Multiple threads Synchronous operation RF Single Scalar Thread SIMT Flynn Taxonomy e.g., pthreads e.g., SSE/AVX e.g., PTX, HSA

(40) Comparison Register File + RF MIMDSIMD SIMT Multiple independent threads and explicit synchronization TLP, DLP, ILP, SPMD Single thread + vector operations DLP Easily embedded in sequential code Multiple, synchronous threads TLP, DLP, ILP (more recent) Explicitly parallel

(41) Performance Metrics: Occupancy Performance implications of programming model properties  Warps, thread blocks, register usage Capture which resources can be dynamically shared, and how to reason about resource demands of a CUDA kernel  Enable device-specific online tuning of kernel parameters 41

(42) CUDA Occupancy Occupancy = (#Active Warps) /(#MaximumActive Warps)  Measure of how well you are using max capacity Limits on the numerator?  Registers/thread  Shared memory/thread block  Number of scheduling slots: thread blocks or warps Limits on the denominator?  Memory bandwidth  Scheduler slots What is the performance impact of varying kernel resource demands?

(43) Performance Goals Core Utilization BW Utilization What are the limiting Factors?

(44) Resource Limits on Occupancy SM Scheduler Kernel Distributor SM DRAM Limits the #threads Limits the #thread blocks Warp Schedulers Warp Context Register File SP TB 0 Thread Block Control L1/Shared Memory Limits the #thread blocks Limits the #threads SM – Stream Multiprocessor SP – Stream Processor

(45) Programming Model Attributes Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) #SMs Max Threads/Block Max Threads/Dim Warp size What do we need to know about target processor? How do these map to kernel configuration parameters that we can control?

(46) Impact of Thread Block Size Consider Fermi with 1536 threads/SM  With 512 threads/block, can only up to 3 thread blocks executing at an instant  With 128 threads/block  12 thread blocks per SM  Consider how many instructions can be in flight? Consider limit of 8 thread blocks/SM?  Only 1024 active threads at a time  Occupancy = To maximize utilization, thread block size should balance demand for thread blocks vs. thread slots

(47) Impact of #Registers Per Thread Assume 10 registers/thread and a thread block size of 256 Number of registers per SM = 16K A thread block requires 2560 registers for a maximum of 6 thread blocks per SM  Uses all 1536 thread slots What is the impact of increasing number of registers by 2?  Granularity of management is a thread block!  Loss of concurrency of 256 threads!

(48) Impact of Shared Memory Shared memory is allocated per thread block  Can limit the number of thread blocks executing concurrently per SM As we change the gridDim and blockDim parameters how does demand change for shared memory, number of thread slots, or number of thread block slots?

(49) Thread Granularity How fine grained should threads be?  Coarser grain threads  oIncreased register pressure, shared memory pressure oLower pressure on thread block slots and thread slots Merge threads blocks Share row values  Reduce memory BW d_N d_M d_P Merge these two threads

(50) Balance #Threads /Block #Thread Blocks Shared memory/ Thread block # Registers /Thread Navigate the tradeoffs to maximize core utilization and memory bandwidth utilization for the target device Goal: Increase occupancy until one or the other is saturated

(51) Performance Tuning Auto-tuning to maximize performance for a device Query device properties to tune kernel parameters prior to launch Tune kernel properties to maximize efficiency of execution _1/rel/toolkit/docs/online/index.htmlhttp://developer.download.nvidia.com/compute/cuda/4 _1/rel/toolkit/docs/online/index.html

(52) Device Properties

(53) Summary Sequence warps through the scalar pipelines Overlap execution from multiple warps to hide memory latency (or special functions) Out of order execution not shown  need scoreboard for correctness Occupancy calculator as a first order estimate of correctness

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Questions?