Download presentation
Presentation is loading. Please wait.
1
Intermediate GPGPU Programming in CUDA
Supada Laosooksathit
2
NVIDIA Hardware Architecture
Host memory Terminologies Global memory Shared memory SMs
3
Recall 5 steps for CUDA Programming Initialize device
Allocate device memory Copy data to device memory Execute kernel Copy data back from device memory
4
Initialize Device Calls
To select the device associated to the host thread cudaSetDevice(device) This function must be called before any __global__ function, otherwise device 0 is automatically selected. To get number of devices cudaGetDeviceCount(&devicecount) To retrieve device’s property cudaGetDeviceProperties(&deviceProp, device)
5
Hello World Example Allocate host and device memory
6
Hello World Example Host code
7
Hello World Example Kernel code
8
To Try CUDA Programming
SSH to Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK Compile the following directories NVIDIA_GPU_Computing_SDK/shared/ NVIDIA_GPU_Computing_SDK/C/common/ The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/
9
Demo Hello World Vector Add Print out block and thread IDs C = A + B
Show some real demos.. Above one and additional ones in the dirs
10
CUDA Language Concept CUDA programming model CUDA memory model
11
Some Terminologies Device = GPU = set of stream multiprocessors
Stream Multiprocessor (SM) = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory
12
CUDA Programming Model
Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread // Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }
13
Thread Hierarchy Threads launched for a parallel section are partition into thread blocks Thread block is a group of threads that can: Synchronize their execution Communicate via a low latency shared memory Grid = all thread blocks for a given launch
15
IDs and Dimensions Threads Blocks Dimensions are set at launch time
3D IDs Unique within a block – two threads from two different blocks cannot cooperate Blocks 2D and 3D IDs (depend on the hardware) Unique within a grid Dimensions are set at launch time Can be unique for each section Built-in variables: threadIdx, blockIdx blockDim, gridDim
16
Kernel 1 Kernel 2 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1)
Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)
19
CUDA Memory Model Each thread can:
R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host The host can R/W global, constant, and texture memories
20
Host memory
21
Device DRAM Global memory Texture and Constant Memories
Main means of communicating R/W data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host
22
CUDA Global Memory Allocation
cudaMalloc(pointer, memsize) Allocates object in the device Global Memory pointer = address of a pointer to the allocated object memsize = Size of allocated object cudaFree(pointer) Frees object from device Global Memory
23
CUDA Host-Device Data Transfer
cudaMemcpy() Memory data transfer Requires four parameters Pointer to source Pointer to destination Number of bytes copied Type of transfer: Host to Host, Host to Device, Device to Host, Device to Device
24
CUDA Function Declaration
Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ defines a kernel function Must return void
25
CUDA Function Calls Restrictions
__device__ functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments
26
Calling a Kernel Function – Thread Creation
A kernel function must be called with an execution configuration: KernelFunc<<< DimGrid, DimBlock, SharedMemBytes, Streams >>>(...); DimGrid = dimension and size of the grid DimBlock = dimension and size of each block SharedMemBytes specifies the number of bytes in shared memory (option) Streams specifies the associated stream (option)
27
NVIDIA Hardware Architecture
Host memory Terminologies Global memory Shared memory SMs
28
NVIDIA Hardware Architecture
Terminologies -Compute capability -Threads in a block will be grouped into a warp of 32 threads -Execute in the cores SM
29
Specifications of a Device
Compute Capability 1.3 Compute Capability 2.0 Warp size 32 Max threads/block 512 1024 Max Blocks/grid 65535 Shared mem 16 KB/SM 48 KB/SM For more details deviceQuery in CUDA SDK Appendix F in Programming Guide 4.0
30
Demo deviceQuery Show hardware specifications in details
31
Memory Optimizations Reduce the time of memory transfer between host and device Use asynchronous memory transfer (CUDA streams) Use zero copy Reduce the number of transactions between on-chip and off-chip memory Memory coalescing Avoid bank conflicts in shared memory
32
Reduce Time of Host-Device Memory Transfer
Regular memory transfer (synchronously)
33
Reduce Time of Host-Device Memory Transfer
CUDA streams Allow overlapping between kernel and memory copy
34
CUDA Streams Example
35
CUDA Streams Example
36
GPU Timers CUDA Events CUDA timer calls An API
Use the clock shade in kernel Accurate for timing kernel executions CUDA timer calls Libraries implemented in CUDA SDK
37
CUDA Events Example
38
Demo simpleStreams
39
Reduce Time of Host-Device Memory Transfer
Zero copy Allow device pointers to access page-locked host memory directly Page-locked host memory is allocated by cudaHostAlloc()
40
Demo Zero copy
41
Reduce number of On-chip and Off-chip Memory Transactions
Threads in a warp access global memory Memory coalescing Copy a bunch of words at the same time
42
Memory Coalescing Threads in a warp access global memory in a straight forward way (4-byte word per thread)
43
Memory Coalescing Memory addresses are aligned in the same segment but the accesses are not sequential
44
Memory Coalescing Memory addresses are not aligned in the same segment
45
Shared Memory 16 banks for compute capability 1.x, 32 banks for compute capability 2.x Help utilizing memory coalescing Bank conflicts may occur Two or more threads in access the same bank In compute capability 1.x, no broadcast In compute capability 2.x, broadcast the same data to many threads that request
46
Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: 1
Threads: Banks: 1 2 3 Threads: Banks: 1 2 3
47
Matrix Multiplication Example
48
Matrix Multiplication Example
Reduce accesses to global memory (A.height/BLOCK_SIZE) times reading A (B.width/BLOCK_SIZE) times reading B
49
Demo Matrix Multiplication With and without shared memory
Different block sizes
50
Control Flow if, switch, do, for, while Branch divergence in a warp
Threads in a warp issue different instruction sets Different execution paths will be serialized Increase number of instructions in that warp
51
Branch Divergence
52
Summary 5 steps for CUDA Programming NVIDIA Hardware Architecture
Memory hierarchy: global memory, shared memory, register file Specifications of a device: block, warp, thread, SM
53
Summary Memory optimization Control flow
Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy Reduce the number of transactions between on-chip and off-chip memory by utilizing memory coalescing (shared memory) Try to avoid bank conflicts in shared memory Control flow Try to avoid branch divergence in a warp
54
References http://docs.nvidia.com/cuda/cuda-c-programming-guide/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.