Shared Memory Accesses

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
Structured Data Types and Encapsulation Mechanisms to create new data types: –Structured data Homogeneous: arrays, lists, sets, Non-homogeneous: records.
Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Main Memory CS448.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Block diagram reduction
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
CS 704 Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM
Chapter 7 Matrix Mathematics
Reducing Hit Time Small and simple caches Way prediction Trace caches
Associativity in Caches Lecture 25
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
The Hardware/Software Interface CSE351 Winter 2013
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Prof. Zhang Gang School of Computer Sci. & Tech.
Morgan Kaufmann Publishers
Why we use banked Instruction Cache
Prof. Zhang Gang School of Computer Sci. & Tech.
Lecture: Cache Hierarchies
COMP4211 : Advance Computer Architecture
Lecture 22: Parallel Algorithms
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Array Processor.
15-740/ Computer Architecture Lecture 13: More Caching
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
Lecture: Cache Innovations, Virtual Memory
CSCI N207 Data Analysis Using Spreadsheet
Information Redundancy Fault Tolerant Computing
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Siddhartha Chatterjee
Lecture 15: Memory Design
Multidimensional array
Lecture 11: Cache Hierarchies
Mattan Erez The University of Texas at Austin
CUDA Execution Model – III Streams and Events
Lecture: Cache Hierarchies
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
ECE 498AL Lecture 10: Control Flow
Arrays.
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS703 - Advanced Operating Systems
Main Memory Background
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture: Coherence and Synchronization
Cache Performance Improvements
Working with Arrays in MATLAB
William Stallings Computer Organization and Architecture
Presentation transcript:

Shared Memory Accesses ©Sudhakar Yalamanchili unless otherwise noted

Objectives Understand how memory access patterns in a warp can affect performance Develop an intuition about how to avoid performance degrading access patterns to shared memory

Reading CUDA Programming Guide CUDA: Best Practices http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract CUDA: Best Practices http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#axzz3kOab4Hrx

Avoiding Shared Memory Bank Conflicts Be cognizant of addressing patterns in a warp Accesses to the same banks are serialized Accesses to the same address on the same bank are broadcast (compute capability 5.0 and 6.0) Check the behavior of shared memory as a function of compute capability CUDA Programming Guide Section H http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Bank Conflicts Conflict free access Conflict free access 2-way Conflict access Conflict free access

Interleaved Memory Organization τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time word interleaving 1 2 3 4 5 6 7 Bank 0 Bank 1 Bank 2 Bank 3 Memory is organized into multiple, concurrent, banks World level interleaving across banks Single address generates multiple, concurrent accesses Well matched to cache line access patterns

Sequential Bank Operation bank 1 bank m-1 m lower order bits n-m higher order bits τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time

Concurrent Bank Operation memory bank access Memory Module τ Time Each bank can be addressed independently Multiple sources of addresses  warp? Difference with interleaved memory Flexibility in addressing Requires greater address bandwidth Separate controllers and memory buses Support for non-blocking caches with multiple outstanding misses

Data Skewing for Concurrent Access A 3-ordered 8 vector with C = 2. How can we guarantee that data can be accessed in parallel? Avoid bank conflicts Storage Scheme: A set of rules that determine for each array element, the address of the module and the location within a module Design a storage scheme to ensure concurrent access d-ordered n vector: the ith element is in module (d.i + C) mod M.

Conflict Free Access Conflict free access to elements of the vector if  M >= N M >= N. gcd(M,d) Multi-dimensional arrays treated as arrays of 1-d vectors Conflict free access for various patterns in a matrix requires M >= N. gcd(M,δ1) for columns M >= N. gcd(M, δ2) for rows M >= N. gcd(M, δ1+ δ2 ) for forward diagonals M >= N. gcd(M, δ1- δ2) for backward diagonals

Conflict Free Access Implications for M = N = even number? For non-power-of-two values of M, indexing and address computation must be efficient Vectors that are accessed are scrambled Unscrambling of vectors is a non-trivial performance issue Data dependencies can still reduce bandwidth far below O(M)

Avoiding Bank Conflicts Many banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses Solutions: Software: loop interchange Software: adjust array size to a prime # (“array padding”) Hardware: prime number of banks (e.g. 17) Data skewing 12

Questions?