GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1) 03.11.2016
Overview Global Memory L2 Cache … Host Note that details about host memory interconnection are platform specific. GPU Device Global Memory L2 Cache L1 Cache Registers Core … Host CPU Host Memory GPU Chip SMP ~ 25 GBps PCI Express (16/32 GBps) > 100 GBps by Martin Kruliš (v1.1) 03.11.2016
Host-Device Transfers PCIe Transfers Much slower than internal GPU data transfers Issued explicitly by host code cudaMemcpy(dst, src, size, direction); With one exception, when the GPU memory is mapped to the host memory space The transfer call has significant overhead Bulk transfers are preferred Overlapping Up to 2 async. transfers while the GPU is computing by Martin Kruliš (v1.1) 03.11.2016
Global Memory Global Memory Properties Off-chip, but on the GPU device High bandwidth and high latency ~ 100 GBps, 400-600 of clock cycles Operated in transactions Continuous aligned segments of 32 B - 128 B Number of transaction depends on caching model, GPU architecture, and memory access pattern by Martin Kruliš (v1.1) 03.11.2016
Global Memory Global Memory Caching Data are cached in L2 cache Relatively small (up to 2MB on new Maxwell GPUs) On CC < 3.0 (Fermi) also cached in L1 cache Configurable by compiler flag -Xptxas -dlcm=ca (Cache Always, i.e. also in L1, default) -Xptxas -dlcm=cg (Cache Global, i.e. L2 only) CC 3.x (Kepler) reserves L1 for local memory caching and registry spilling CC 5.x (Maxwell) separates L1 cache from shared memory and unifies it with texture cache by Martin Kruliš (v1.1) 03.11.2016
Global Memory Coalesced Transfers Number of transactions caused by global memory access depends on the pattern of the access Certain access patterns are optimized CC 1.x Threads sequentially access aligned memory block Subsequent threads access subsequent words CC 2.0 and later Threads access aligned memory block Access within the block can be permuted by Martin Kruliš (v1.1) 03.11.2016
Global Memory Access Patterns Perfectly aligned sequential access by Martin Kruliš (v1.1) 03.11.2016
Global Memory Access Patterns Perfectly aligned with permutation by Martin Kruliš (v1.1) 03.11.2016
Global Memory Access Patterns Continuous sequential, but misaligned by Martin Kruliš (v1.1) 03.11.2016
Global Memory Coalesced Loads Impact by Martin Kruliš (v1.1) 03.11.2016
Shared Memory Memory Shared by SM Divided into banks Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks Optionally, 64-bit words division is used (CC 3.x) Bank conflicts are serialized Except for reading the same address (broadcast) In newer architectures (CC 5.x and 6.x), the size of the shared memory may vary a little, but the limit per thread block remains 48kB. Compute capability Mem. size # of banks latency 1.x 16 kB 16 32 bits / 2 cycles 2.x 48 kB 32 3.x 64 bits / 1 cycle by Martin Kruliš (v1.1) 03.11.2016
Shared Memory Linear Addressing Each thread in warp access different memory bank No collisions by Martin Kruliš (v1.1) 03.11.2016
Shared Memory Linear Addressing with Stride Each thread access 2*i-th item 2-way conflicts (2x slowdown) on CC < 3.0 No collisions on CC 3.x Due to 64-bits per cycle throughput by Martin Kruliš (v1.1) 03.11.2016
Shared Memory Linear Addressing with Stride Each thread access 3*i-th item No collisions, since the number of banks is not divisible by the stride by Martin Kruliš (v1.1) 03.11.2016
Shared Memory Broadcast One set of threads access value in bank #12 and the remaining threads access value in bank #20 Broadcasts are served independently on CC 1.x I.e., sample bellow causes 2-way conflict CC 2.x and newer serve broadcasts simultaneously by Martin Kruliš (v1.1) 03.11.2016
Shared Memory Shared Memory vs. L1 Cache Shared Memory Configuration On CC 2.x and 3.x, they are the same resource Division can be set for each kernel by cudaFuncSetCacheConfig(kernel, cacheConfig); Cache configuration can prefer L1 or shared memory (i.e., selecting 48kB of 64kB for the preferred) Shared Memory Configuration Some devices (CC 3.x) can configure memory banks cudaFuncSetSharedMemConfig(kernel, config); The config selects between 32 bit and 64 bit mode The 32bit mode on CC 3.x devices have one strange feature: If two threads access different addresses in the same bank, but both addresses are in an aligned 64 word (i.e., 32 bit word) block – index of second is index of first + 32 – the memory can handle the requests without a collision. Note that Maxwell (CC 5.x) returned to previous (Fermi) configuration – i.e., the bank size is not configurable and fixed for 32bit words. by Martin Kruliš (v1.1) 03.11.2016
Registers Registers One register pool per multiprocessor 8-64k of 32-bit registers (depending on CC) Register allocation is defined by compiler As fast as the cores (no extra clock cycles) Read-after-write dependency 24 clock cycles Can be hidden if there are enough active warps Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible The programmer have no direct control over conflicts by Martin Kruliš (v1.1) 03.11.2016
Local Memory Per-thread Global Memory Allocated automatically by compiler Compiler may report the amount of allocated local memory (use --ptxas-options=-v) Large local structures and arrays are places here Instead of the registers Register Pressure There is not enough registers to accommodate the data of the thread The registers are spilled into the local memory Can be moderated selecting smaller thread blocks by Martin Kruliš (v1.1) 03.11.2016
Constant and Texture Memory Constant Memory Special 64KB cache for read-only data 8KB is the cache working set per multiprocessor CC 2.x introduces LDU (LoaD Uniform) instruction Compiler uses to force loading read-only variables that are thread-independent into the cache Texture Memory Texture cache is optimized for 2D spatial locality Additional functionality like fast data interpolation, normalized coordinate system, or handling the boundary cases by Martin Kruliš (v1.1) 03.11.2016
Memory Allocation Global Memory Shared Memory cudaMalloc(), cudaFree() Dynamic kernel allocation malloc() and free() called from kernel cudaDeviceSetLimit(cudaLimitMallocHeapSize, size) Shared Memory Statically (e.g., __shared__ int foo[16];) Dynamically (by kernel launch parameter) extern __shared__ float bar[]; float *bar1 = &(bar[0]); float *bar2 = &(bar[size_of_bar1]); by Martin Kruliš (v1.1) 03.11.2016
Implications and Guidelines Global Memory Data should be accessed in coalesced manner Hot data should be manually cached in shared mem Shared Memory Bank conflicts need to be avoided Redesigning data structures in col-wise manner Using strides that are not divisible by # of banks Registers and Local Memory Use as few as possible, avoid registry spilling by Martin Kruliš (v1.1) 03.11.2016
Implications and Guidelines Memory Caching The structures should be designed to utilize caches in best way possible The workset of active blocks should fit L2 cache Providing maximum information for the compiler Using const for constant data Using __restrict__ to indicate that no pointer aliasing will occur Data Alignment Operate on 32bit/64bit values only Align data structures to suitable powers of 2 by Martin Kruliš (v1.1) 03.11.2016
Maxwell Architecture What is new in Maxwell…. L1 merges with texture cache Data are cached in L1 the same way as in Fermi Shared memory is independent 64k or 96k not shared with L1 Shared memory uses 32bit banks Revert to Fermi-like style, keeping the aggregated bandwidth Faster shared memory atomic operations by Martin Kruliš (v1.1) 03.11.2016
Discussion by Martin Kruliš (v1.1) 03.11.2016