Shared Memory Accesses ©Sudhakar Yalamanchili unless otherwise noted
Objectives Understand how memory access patterns in a warp can affect performance Develop an intuition about how to avoid performance degrading access patterns to shared memory
Reading CUDA Programming Guide CUDA: Best Practices http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract CUDA: Best Practices http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#axzz3kOab4Hrx
Avoiding Shared Memory Bank Conflicts Be cognizant of addressing patterns in a warp Accesses to the same banks are serialized Accesses to the same address on the same bank are broadcast (compute capability 5.0 and 6.0) Check the behavior of shared memory as a function of compute capability CUDA Programming Guide Section H http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Bank Conflicts Conflict free access Conflict free access 2-way Conflict access Conflict free access
Interleaved Memory Organization τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time word interleaving 1 2 3 4 5 6 7 Bank 0 Bank 1 Bank 2 Bank 3 Memory is organized into multiple, concurrent, banks World level interleaving across banks Single address generates multiple, concurrent accesses Well matched to cache line access patterns
Sequential Bank Operation bank 1 bank m-1 m lower order bits n-m higher order bits τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time
Concurrent Bank Operation memory bank access Memory Module τ Time Each bank can be addressed independently Multiple sources of addresses warp? Difference with interleaved memory Flexibility in addressing Requires greater address bandwidth Separate controllers and memory buses Support for non-blocking caches with multiple outstanding misses
Data Skewing for Concurrent Access A 3-ordered 8 vector with C = 2. How can we guarantee that data can be accessed in parallel? Avoid bank conflicts Storage Scheme: A set of rules that determine for each array element, the address of the module and the location within a module Design a storage scheme to ensure concurrent access d-ordered n vector: the ith element is in module (d.i + C) mod M.
Conflict Free Access Conflict free access to elements of the vector if M >= N M >= N. gcd(M,d) Multi-dimensional arrays treated as arrays of 1-d vectors Conflict free access for various patterns in a matrix requires M >= N. gcd(M,δ1) for columns M >= N. gcd(M, δ2) for rows M >= N. gcd(M, δ1+ δ2 ) for forward diagonals M >= N. gcd(M, δ1- δ2) for backward diagonals
Conflict Free Access Implications for M = N = even number? For non-power-of-two values of M, indexing and address computation must be efficient Vectors that are accessed are scrambled Unscrambling of vectors is a non-trivial performance issue Data dependencies can still reduce bandwidth far below O(M)
Avoiding Bank Conflicts Many banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses Solutions: Software: loop interchange Software: adjust array size to a prime # (“array padding”) Hardware: prime number of banks (e.g. 17) Data skewing 12
Questions?