Download presentation
Presentation is loading. Please wait.
1
CUDA and the Memory Model (Part II)
2
Code executed on GPU
3
Variable Qualifiers (GPU Code)
4
CUDA: Features available to kernals Standard mathematical functions Sinf, powf, atanf, ceil, etc Built-in vector types Float4, int4, uint4, etc for dimensions 1…4 Texture accesses in kernels Texture my_texture // declare texture reference Float4 texel = texfetch (my_texture, u, v);
5
Thread Synchronization function
6
Host Synchronization (for Kalin…)
7
Thread Blocks must be independent
9
Example: Increment Array Elements
11
Example: Host Code
12
CUDA Error Reporting to CPU
13
CUDA Event API (someone asked out this….)
14
Shared Memory On-chip 2 orders of magnitude lower latency than global memory Order of magnitude higher bandwidth than global memory 16KB per multiprocessor NVIDIA GPUs contain up to ~ 30 multiprocessors Allocated per threadblock Accessible by any thread in the threadblock Not accessible to other threadblocks Several uses: Sharing data among threads in a threadblock User-managed cache (reducing global memory accesses)
15
Using Shared Memory Slide Courtesy of NVIDA: Timo Stich
16
Using Shared Memory
19
Thread counts More threads per block are better for time slicing -Minimum: 64, Ideal: 192-256 More threads per block means fewer registers per thread -Kernel invocation may fail if the kernel compiles to more registers than are available Threads within a block can be synchronized -Important for SIMD efficiency The maximum threads allowed per grid is 64K^3
20
Block Counts There should be at least as many blocks as multiprocessors -The number of blocks should be at least 100 to scale to future generations Blocks within a grid can not be synchronized Blocks can only be swapped by partitioning registers and shared memory among them
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.