Download presentation
Presentation is loading. Please wait.
1
Shared Memory Accesses
©Sudhakar Yalamanchili unless otherwise noted
2
Objectives Understand how memory access patterns in a warp can affect performance Develop an intuition about how to avoid performance degrading access patterns to shared memory
3
Reading CUDA Programming Guide CUDA: Best Practices
CUDA: Best Practices
4
Avoiding Shared Memory Bank Conflicts
Be cognizant of addressing patterns in a warp Accesses to the same banks are serialized Accesses to the same address on the same bank are broadcast (compute capability 5.0 and 6.0) Check the behavior of shared memory as a function of compute capability CUDA Programming Guide Section H
5
Bank Conflicts Conflict free access Conflict free access
2-way Conflict access Conflict free access
6
Interleaved Memory Organization
τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time word interleaving 1 2 3 4 5 6 7 Bank 0 Bank 1 Bank 2 Bank 3 Memory is organized into multiple, concurrent, banks World level interleaving across banks Single address generates multiple, concurrent accesses Well matched to cache line access patterns
7
Sequential Bank Operation
bank 1 bank m-1 m lower order bits n-m higher order bits τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time
8
Concurrent Bank Operation
memory bank access Memory Module τ Time Each bank can be addressed independently Multiple sources of addresses warp? Difference with interleaved memory Flexibility in addressing Requires greater address bandwidth Separate controllers and memory buses Support for non-blocking caches with multiple outstanding misses
9
Data Skewing for Concurrent Access
A 3-ordered 8 vector with C = 2. How can we guarantee that data can be accessed in parallel? Avoid bank conflicts Storage Scheme: A set of rules that determine for each array element, the address of the module and the location within a module Design a storage scheme to ensure concurrent access d-ordered n vector: the ith element is in module (d.i + C) mod M.
10
Conflict Free Access Conflict free access to elements of the vector if M >= N M >= N. gcd(M,d) Multi-dimensional arrays treated as arrays of 1-d vectors Conflict free access for various patterns in a matrix requires M >= N. gcd(M,δ1) for columns M >= N. gcd(M, δ2) for rows M >= N. gcd(M, δ1+ δ2 ) for forward diagonals M >= N. gcd(M, δ1- δ2) for backward diagonals
11
Conflict Free Access Implications for M = N = even number?
For non-power-of-two values of M, indexing and address computation must be efficient Vectors that are accessed are scrambled Unscrambling of vectors is a non-trivial performance issue Data dependencies can still reduce bandwidth far below O(M)
12
Avoiding Bank Conflicts
Many banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses Solutions: Software: loop interchange Software: adjust array size to a prime # (“array padding”) Hardware: prime number of banks (e.g. 17) Data skewing 12
13
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.