© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 2 A quiet revolution and potential build-up –Calculation: 367 GFLOPS vs. 32 GFLOPS –Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s –Until recently, programmed through graphics API –GPU in every PC and workstation – massive volume and potential impact GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 GPU: A Massively Parallel Processor

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 3 What is GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics –GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes –Large data arrays, streaming throughput –Fine-grain SIMD parallelism –Low-latency floating point (FP) computation Applications – see //GPGPU.org –Game effects (FX) physics, image processing –Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 4 Previous GPGPU Constraints Dealing with graphics API –Working with the corner cases of the graphics API Addressing modes –Limited texture size/dimension Shader capabilities –Limited outputs Instruction sets –Lack of Integer & bit ops Communication limited –Between pixels –Scatter a[i] = p Input Registers Fragment Program Output Registers Constants Texture Temp Registers per thread per Shader per Context FB Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 5 CUDA “Compute Unified Device Architecture” General purpose programming model –User kicks off batches of threads on the GPU –GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack –Compute oriented drivers, language, and tools Driver for loading computation programs into GPU –Standalone Driver - Optimized for computation –Interface designed for compute – graphics-free API –Data sharing with OpenGL buffer objects –Guaranteed maximum download & readback speeds –Explicit GPU memory management

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 6 Parallel Computing on a GPU 8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications –Available in laptops, desktops, and clusters GPU parallelism is doubling every year Programming model scales transparently Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism GeForce 8800 Tesla S870 Tesla D870

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 7 CUDA – C with no shader limitations! Integrated host+device app C program –Serial or modestly parallel parts in host C code –Highly parallel parts in device SPMD kernel C code Serial Code (host)‏... Parallel Kernel (device)‏ KernelA >>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB >>(args);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 8 CUDA Devices and Threads A compute device –Is a coprocessor to the CPU or host –Has its own DRAM (device memory)‏ –Runs many threads in parallel –Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels, which run on many threads Differences between GPU and CPU threads –GPU threads are extremely lightweight Very little creation overhead –GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 9 L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB The future of GPUs is programmable processing So – build the architecture around the processor G80 – Graphics Mode

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 10 G80 CUDA mode – A Device Example Processors execute computing threads New operating mode/HW interface for computing

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 11 Extended C Declspecs –global, device, shared, local, constant Keywords –threadIdx, blockIdx Intrinsics –__syncthreads Runtime API –Memory, symbol, execution management Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M];... region[threadIdx] = image[i]; __syncthreads()... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve >> (myimage);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 12 gcc / cl G80 SASS foo.sass OCG Extended C cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s CPU Host Code foo.cpp Integrated source (foo.cu) Mark Murphy, “NVIDIA’s Experience with Open64,”NVIDIA’s Experience with Open64 www.capsl.udel.edu/conferences/open64/2008 /Papers/101.doc

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 13 CUDA API Highlights: Easy and Lightweight The API is an extension to the ANSI C programming language Low learning curve The hardware is designed to enable lightweight runtime and driver High performance

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 14 CUDA Thread Block All threads in a block execute the same kernel program (SPMD) Programmer declares block: –Block size 1 to 512 concurrent threads –Block shape 1D, 2D, or 3D –Block dimensions in threads Threads have thread id numbers within block –Thread program uses thread id to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate –Each block can execute in any order relative to other blocks! CUDA Thread Block Thread Id #: 0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 15 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID Thread Block 0 … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block N - 1 Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks –Threads within a block cooperate via shared memory, atomic operations and barrier synchronization –Threads in different blocks cannot cooperate 7654321076543210 76543210

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 16 Transparent Scalability Hardware is free to assign blocks to any processor at any time –A kernel scales across any number of parallel processors Device Block 0Block 1 Block 2Block 3 Block 4Block 5 Block 6Block 7 Kernel grid Block 0Block 1 Block 2Block 3 Block 4Block 5 Block 6Block 7 Device Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7 Each block can execute in any order relative to other blocks. time

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 17 G80 Example: Executing Thread Blocks Threads are assigned to Streaming Multiprocessors in block granularity –Up to 8 blocks to each SM as resource allows –SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently –SM maintains thread/block id #s –SM manages/schedules thread execution t0 t1 t2 … tm Blocks SP Shared Memory MT IU SP Shared Memory MT IU t0 t1 t2 … tm Blocks SM 1SM 0 Flexible resource allocation

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 18 G80 Example: Thread Scheduling Each Block is executed as 32-thread Warps –An implementation decision, not part of the CUDA programming model –Warps are scheduling units in SM If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? –Each Block is divided into 256/32 = 8 Warps –There are 8 * 3 = 24 Warps … t0 t1 t2 … t31 … … … Block 1 WarpsBlock 2 Warps SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1 Streaming Multiprocessor Shared Memory … t0 t1 t2 … t31 … Block 3 Warps

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 19 G80 Example: Thread Scheduling SM implements zero-overhead warp scheduling –At any time, only one of the warps is executed by SM –Warps whose next instruction has its operands ready for consumption are eligible for execution –Eligible Warps are selected for execution on a prioritized scheduling policy –All threads in a warp execute the same instruction when selected

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 20 Block IDs and Thread IDs Each thread uses IDs to decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data –Image processing –Solving PDEs on volumes –…

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 21 Terminology Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) –The unit of parallelism in CUDA Warp: a group of threads executed physically in parallel in G80 Block: a group of threads that are executed together and form the unit of resource assignment Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 23 CUDA Memory Model Overview Global memory –Main means of communicating R/W data between host and device –Contents visible to all threads –Long latency access We will focus on global memory for now –Constant and texture memory will come later Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 24 CUDA Device Memory Allocation cudaMalloc() Global Memory –Allocates object in the device Global Memory –Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudaFree() –Frees object from device Global Memory Pointer to freed object Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 25 CUDA Device Memory Allocation (cont.)‏ Code example: –Allocate a 64 * 64 single precision float array –Attach the allocated storage to Md –“d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 26 CUDA Host-Device Data Transfer cudaMemcpy()‏ –memory data transfer –Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer –Host to Host –Host to Device –Device to Host –Device to Device Asynchronous transfer Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 27 CUDA Host-Device Data Transfer (cont.) Code example: –Transfer a 64 * 64 single precision float array –M is in host memory and Md is in device memory –cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 28 CUDA Function Declarations host __host__ float HostFunc()‏ hostdevice __global__ void KernelFunc()‏ device __device__ float DeviceFunc()‏ Only callable from the: Executed on the: __global__ defines a kernel function –Must return void __device__ and __host__ can be used together

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 29 CUDA Function Declarations (cont.)‏ __device__ functions cannot have their address taken For functions executed on the device: –No recursion –No static variable declarations inside the function –No variable number of arguments

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 30 Calling a Kernel Function – Thread Creation A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc >>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 31 G80 Implementation of CUDA Memories Each thread can: –Read/write per-thread registers –Read/write per-thread local memory –Read/write per-block shared memory –Read/write per-grid global memory –Read/only per-grid constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 32 CUDA Variable Type Qualifiers __device__ is optional when used with __local__, __shared__, or __constant__ Automatic variables without any qualifier reside in a register –Except arrays that reside in local memory Variable declarationMemoryScopeLifetime __device__ __local__ int LocalVar; localthread __device__ __shared__ int SharedVar; sharedblock __device__ int GlobalVar; globalgridapplication __device__ __constant__ int ConstantVar; constantgridapplication

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 34 Variable Type Restrictions Pointers can only point to memory allocated or declared in global memory: –Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) –Obtained as the address of a global variable: float* ptr = &GlobalVar;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 35 A Common Programming Strategy Global memory resides in device memory (DRAM) - much slower access than shared memory So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: –Partition data into subsets that fit into shared memory –Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 36 A Common Programming Strategy (Cont.) Constant memory also resides in device memory (DRAM) - much slower access than shared memory –But… cached! –Highly efficient access for read-only data Carefully divide data according to access patterns –R/Only  constant memory (very fast if in cache) –R/W shared within Block  shared memory (very fast) –R/W within each thread  registers (very fast) –R/W inputs/results  global memory (very slow) For texture memory usage, see NVIDIA document.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 37 GPU Atomic Integer Operations Atomic operations on integers in global memory: –Associative operations on signed/unsigned ints –add, sub, min, max,... –and, or, xor –Increment, decrement –Exchange, compare and swap Requires hardware with compute capability 1.1 and above.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 38 SM Register File Register File (RF) –32 KB (8K entries) for each SM in G80 TEX pipe can also read/write RF –2 SMs share 1 TEX Load/Store pipe can also read/write RF I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MADSFU

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 39 Programmer View of Register File There are 8192 registers in each SM in G80 –This is an implementation decision, not part of CUDA –Registers are dynamically partitioned across all blocks assigned to the SM –Once assigned to a block, the register is NOT accessible by threads in other blocks –Each thread in the same block only access registers assigned to itself 4 blocks 3 blocks

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 40 Example If each Block has 16x16 threads and each thread uses 10 registers, how many thread can run on each SM? –Each block requires 10*256 = 2560 registers –8192 = 3 * 2560 + change –So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? –Each Block now requires 11*256 = 2816 registers –8192 < 2816 *3 –Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 41 More on Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programmers –One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models –The compiler can trade off between instruction-level parallelism and thread-level parallelism

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 42 ILP vs. TLP Example Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads take 200 cycles –3 Blocks can run on each SM If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load –Only two Blocks can run on each SM –However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency –Two blocks have 16 Warps. The performance can be actually higher!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 43 Memory Coalescing When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Md Nd W I D T H WIDTH Thread 1 Thread 2 Not coalescedcoalesced

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 44 Parallel Memory Architecture In a parallel machine, many threads access memory –Therefore, memory is divided into banks –Essential to achieve high bandwidth Each bank can service one address per cycle –A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict –Conflicting accesses are serialized Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 45 Bank Addressing Examples No Bank Conflicts –Linear addressing stride == 1 No Bank Conflicts –Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 46 Bank Addressing Examples 2-way Bank Conflicts –Linear addressing stride == 2 8-way Bank Conflicts –Linear addressing stride == 8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 47 How addresses map to banks on G80 Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks –So bank = address % 16 –Same as the size of a half-warp No bank conflicts between different half-warps, only within a single half-warp

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 48 Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: –If all threads of a half-warp access different banks, there is no bank conflict –If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: –Bank Conflict: multiple threads in the same half-warp access the same bank –Must serialize the accesses –Cost = max # of simultaneous accesses to a single bank

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 49 Linear Addressing Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks –16 on G80, so s must be odd Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 s=3 s=1

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 51 How thread blocks are partitioned Thread blocks are partitioned into warps –Thread IDs within a warp are consecutive and increasing –Warp 0 starts with Thread ID 0 Partitioning is always the same –Thus you can use this knowledge in control flow –However, the exact size of warps may change from generation to generation –(Covered next) However, DO NOT rely on any ordering between warps –If there are any dependencies between threads, you must __syncthreads() to get correct results

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 52 Control Flow Instructions Main performance concern with branching is divergence –Threads within a single warp take different paths –Different execution paths are serialized in G80 The control paths taken by the threads in a warp are traversed one at a time until there are no more A common case: avoid divergence when branch condition is a function of thread ID –Example with divergence: If (threadIdx.x > 2) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp –Example without divergence: If (threadIdx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 53 Parallel Reduction Given an array of values, “reduce” them to a single value in parallel Examples –Sum reduction: sum of all values in the array –Max reduction: maximum of all values in the array Typically parallel implementation: –Recursively halve # threads, add two values per thread –Takes log(n) steps for n elements, requires n/2 threads

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 54 A Vector Reduction Example Assume an in-place reduction using shared memory –The original vector is in device global memory –The shared memory used to hold a partial sum vector –Each iteration brings the partial sum vector closer to the final sum –The final solution will be in element 0

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 55 A simple implementation Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 56 Device Runtime Component: Synchronization Function void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory Allowed in conditional constructs only if the conditional is uniform across the entire thread block

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 57 Vector Reduction with Bank Conflicts 01234576109811 0+12+34+56+710+118+9 0...34..78..11 0..78..15 1 2 3 Array elements iterations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 58 Vector Reduction with Branch Divergence 01234576109811 0+12+34+56+710+118+9 0...34..78..11 0..78..15 1 2 3 Array elements iterations Thread 0Thread 8Thread 2Thread 4Thread 6Thread 10

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 59 Some Observations In each iteration, two control flow paths will be sequentially traversed for each warp –Threads that perform addition and threads that do not –Threads that do not perform addition may cost extra cycles depending on the implementation of divergence No more than half of threads will be executing at any time –All odd index threads are disabled right from the beginning! –On average, less than ¼ of the threads will be activated for all warps over time –After the 5 th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence This can go on for a while, up to 4 more iterations (512/32=16= 2 4 ), where each iteration only has one thread activated until all warps retire

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 60 Shortcomings of the implementation Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; } BAD: Divergence due to interleaved branch decisions

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 61 A better implementation Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x / 2; stride > 1; stride /= 2) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 63 Some Observations About the New Implementation Only the last 5 iterations will have divergence Entire warps will be shut down as iterations progress –For a 512-thread block, 4 iterations to shut down all but one warp in each block –Better resource utilization, will likely retire warps and thus blocks run faster Recall, no bank conflicts either

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 65 Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs –Local, register and shared memory usage –Thread ID usage –Memory data transfer API between host and device –Assume square matrix for simplicity

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 66 Square Matrix Multiplication P = M * N of size WIDTH x WIDTH Without tiling: –One thread calculates one element of P –M and N are loaded WIDTH times from global memory M N P WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 67 M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 Memory Layout of a Matrix in C M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 68 Step 1: Matrix Multiplication A Simple Host Version in C M N P WIDTH // Matrix multiplication on the (CPU) host in single precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * Width + k]; float b = N[k * Width + j]; sum += a * b; } P[i * Width + j] = sum; } i k k j

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 69 void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏ { int size = Width * Width * sizeof(float); float *Md, *Nd, *Pd; … 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); Step 2: Input Matrix Data Transfer (Host-side Code)‏

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 70 Step 3: Output Matrix Data Transfer (Host-side Code)‏ 2. // Kernel invocation code – to be shown later … 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 71 Step 4: Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 72 Nd MdPd WIDTH Step 4: Kernel Function (cont.)‏ for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } ty tx ty tx k k

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 73 // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads! MatrixMulKernel >>(Md, Nd, Pd, Width); Step 5: Kernel Invocation (Host-side Code)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 74 Only One Thread Block Used One Block of threads compute matrix Pd –Each thread computes one element of Pd Each thread –Loads a row of matrix Md –Loads a column of matrix Nd –Performs one multiply and one addition for each pair of Md and Nd elements –Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by the number of threads allowed in a thread block Grid 1 Block 1 48 Thread (2, 2)‏ WIDTH Md Pd Nd

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 75 Step 7: Handling Arbitrary Sized Square Matrices Have each 2D thread block compute a (TILE_WIDTH) 2 sub-matrix (tile) of the result matrix –Each has (TILE_WIDTH) 2 threads Generate a 2D Grid of (WIDTH/TILE_WIDTH) 2 blocks Md Nd Pd WIDTH ty tx by bx You still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)! TILE_WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 76 P 1,0 P 0,0 P 0,1 P 2,0 P 3,0 P 1,1 P 0,2 P 2,2 P 3,2 P 1,2 P 3,1 P 2,1 P 0,3 P 2,3 P 3,3 P 1,3 Block(0,0)Block(1,0) Block(1,1)Block(0,1) TILE_WIDTH = 2 A Small Example

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 77 Revised Matrix Multiplication Kernel using Multiple Blocks __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 78 G80 Block Granularity Considerations For Matrix Multiplication using multiple blocks, should I use 8x8, 16x16 or 32x32 blocks? –For 8x8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, because each SM can only take up to 8 Blocks, only 512 threads will go into each SM! –For 16x16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. –For 32x32, we have 1024 threads per Block. Not even one can fit into an SM!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 79 Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory How about performance on G80? All threads access global memory for their input matrix elements –Two memory accesses (8 bytes) per floating point multiply-add –4B/s of memory bandwidth/FLOPS –4*346.5 = 1386 GB/s required to achieve peak FLOP rating –86.4 GB/s limits the code at 21.6 GFLOPS The actual code runs at about 15 GFLOPS Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 81 Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; 10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; 11. __syncthreads(); 11. for (int k = 0; k < TILE_WIDTH; ++k) 12. Pvalue += Mds[ty][k] * Nds[k][tx]; 13. __synchthreads(); 14.} 13. Pd[Row*Width+Col] = Pvalue; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 82 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH-1 2 012 by ty 2 1 0 TILE_WIDTH-1 2 1 0 TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub m kbx by k m

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 83 G80 Shared Memory and Threading Each SM in G80 has 16KB shared memory –SM size is implementation dependent! –For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. –Can potentially have up to 8 Thread Blocks actively executing This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) –The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 –The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA.

Similar presentations

Presentation on theme: "© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA.

Similar presentations

Presentation on theme: "© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA."— Presentation transcript:

Similar presentations

About project

Feedback