GPU Programming with CUDA

GPU Programming with CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

What is GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications – see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting Mark Harris, of Nvidia, runs the gpgpu.org website © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA “Compute Unified Device Architecture”
General purpose programming model User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs into GPU Standalone Driver - Optimized for computation Interface designed for compute – graphics-free API Data sharing with OpenGL buffer objects Guaranteed maximum download & readback speeds Explicit GPU memory management © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Parallel Computing on a GPU
8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications Available in laptops, desktops, and clusters GPU parallelism is doubling every year Programming model scales transparently Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism GeForce 8800 Tesla D870 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign Tesla S870

CUDA – C with no shader limitations!
Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelA<<< nBlk, nTid >>>(args); Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelB<<< nBlk, nTid >>>(args); © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Devices and Threads
A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory)‏ Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels, which run on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

G80 CUDA mode – A Device Example
Processors execute computing threads New operating mode/HW interface for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Extended C Declspecs Keywords Intrinsics Runtime API Function launch
global, device, shared, local, constant Keywords threadIdx, blockIdx Intrinsics __syncthreads Runtime API Memory, symbol, execution management Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Extended C cudacc EDG C/C++ frontend Open64 Global Optimizer
Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s CPU Host Code foo.cpp TBD: Define SASS = SPA Assembly language. OCG: Optimized Code Generation OCG gcc / cl G80 SASS foo.sass Mark Murphy, “NVIDIA’s Experience with Open64,” © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA API Highlights: Easy and Lightweight
The API is an extension to the ANSI C programming language Low learning curve The hardware is designed to enable lightweight runtime and driver High performance © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Thread Block CUDA Thread Block
All threads in a block execute the same kernel program (SPMD) Programmer declares block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads Threads have thread id numbers within block Thread program uses thread id to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate Each block can execute in any order relative to other blocs! CUDA Thread Block Thread Id #: … m Thread program Courtesy: John Nickolls, NVIDIA © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign

Thread Blocks: Scalable Cooperation
Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 Thread Block 0 Thread Block N - 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; threadID 7 6 5 4 3 2 1 7 6 5 4 3 2 1 7 6 5 4 3 2 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; … © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Transparent Scalability
Hardware is free to assign blocks to any processor at any time A kernel scales across any number of parallel processors Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Kernel grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device Block 0 Block 1 Block 2 Block 3 time Block 4 Block 5 Block 6 Block 7 Each block can execute in any order relative to other blocks. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign

Block IDs and Thread IDs
Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes … © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

G80 Example: Executing Thread Blocks
SM 0 SM 1 t0 t1 t2 … tm t0 t1 t2 … tm SP Shared Memory MT IU SP Shared Memory MT IU Blocks Threads are assigned to Streaming Multiprocessors in block granularity Up to 8 blocks to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently SM maintains thread/block id #s SM manages/schedules thread execution Blocks Flexible resource allocation © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign

G80 Example: Thread Scheduling
Each Block is executed as 32-thread Warps An implementation decision, not part of the CUDA programming model Warps are scheduling units in SM If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps … Block 1 Warps … Block 2 Warps … Block 1 Warps … … … t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31 Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign

G80 Example: Thread Scheduling
SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign

Terminology Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) The unit of parallelism in CUDA Warp: a group of threads executed physically in parallel in G80 Block: a group of threads that are executed together and form the unit of resource assignment Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Memories © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

CUDA Memory Model Overview
Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Long latency access We will focus on global memory for now Constant and texture memory will come later Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Device Memory Allocation
cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Device Memory Allocation (cont.)‏
Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md “d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md); © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Host-Device Data Transfer
cudaMemcpy()‏ memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfer Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Host-Device Data Transfer (cont.)
Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost); © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Function Declarations
host __host__ float HostFunc()‏ device __global__ void KernelFunc()‏ __device__ float DeviceFunc()‏ Only callable from the: Executed on the: __global__ defines a kernel function Must return void __device__ and __host__ can be used together © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Function Declarations (cont.)‏
__device__ functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Calling a Kernel Function – Thread Creation
A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

G80 Implementation of CUDA Memories
Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read/only per-grid constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Constant Memory Global, constant, and texture memory spaces are persistent across kernels called by the same application. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign

CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime __device__ __local__ int LocalVar; local thread __device__ __shared__ int SharedVar; shared block __device__ int GlobalVar; global grid application __device__ __constant__ int ConstantVar; constant __device__ is optional when used with __local__, __shared__, or __constant__ Automatic variables without any qualifier reside in a register Except arrays that reside in local memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign

Where to Declare Variables?
Can host access it? Outside of any Function In the kernel yes no register (automatic) shared local global constant © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign

Variable Type Restrictions
Pointers can only point to memory allocated or declared in global memory: Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) Obtained as the address of a global variable: float* ptr = &GlobalVar; © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign

A Common Programming Strategy
Global memory resides in device memory (DRAM) - much slower access than shared memory So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: Partition data into subsets that fit into shared memory Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign

A Common Programming Strategy (Cont.)
Constant memory also resides in device memory (DRAM) - much slower access than shared memory But… cached! Highly efficient access for read-only data Carefully divide data according to access patterns R/Only  constant memory (very fast if in cache) R/W shared within Block  shared memory (very fast) R/W within each thread  registers (very fast) R/W inputs/results  global memory (very slow) For texture memory usage, see NVIDIA document. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign

GPU Atomic Integer Operations
Atomic operations on integers in global memory: Associative operations on signed/unsigned ints add, sub, min, max, ... and, or, xor Increment, decrement Exchange, compare and swap Requires hardware with compute capability 1.1 and above. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 34

SM Register File Register File (RF) TEX pipe can also read/write RF
32 KB (8K entries) for each SM in G80 TEX pipe can also read/write RF 2 SMs share 1 TEX Load/Store pipe can also read/write RF I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Programmer View of Register File
4 blocks 3 blocks There are 8192 registers in each SM in G80 This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Each thread in the same block only access registers assigned to itself © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Example If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? Each block requires 10*256 = 2560 registers 8192 = 3 * change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each Block now requires 11*256 = 2816 registers 8192 < 2816 *3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!! © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

More on Dynamic Partitioning
Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models The compiler can trade off between instruction-level parallelism and thread level parallelism © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

ILP vs. TLP Example Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads take 200 cycles 3 Blocks can run on each SM If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load Only two Blocks can run on each SM However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency Two blocks have 16 Warps. The performance can be actually higher! © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Memory Coalescing When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Not coalesced coalesced Md Nd Thread 1 H T D Thread 2 I W WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Parallel Memory Architecture
In a parallel machine, many threads access memory Therefore, memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Bank Addressing Examples
No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Bank Addressing Examples
2-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

How addresses map to banks on G80
Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp No bank conflicts between different half-warps, only within a single half-warp © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Shared memory bank conflicts
Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

Linear Addressing Given:
Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd s=3 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign

GPU Programming with CUDA

Similar presentations

Presentation on theme: "GPU Programming with CUDA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU Programming with CUDA

Similar presentations

Presentation on theme: "GPU Programming with CUDA"— Presentation transcript:

Similar presentations

About project

Feedback