Shared Memory Accesses

Slides:

Advertisements

Similar presentations

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

Structured Data Types and Encapsulation Mechanisms to create new data types: –Structured data Homogeneous: arrays, lists, sets, Non-homogeneous: records.

Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Main Memory CS448.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Block diagram reduction

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

CS 704 Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM

Chapter 7 Matrix Mathematics

Reducing Hit Time Small and simple caches Way prediction Trace caches

Associativity in Caches Lecture 25

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

The Hardware/Software Interface CSE351 Winter 2013

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Prof. Zhang Gang School of Computer Sci. & Tech.

Morgan Kaufmann Publishers

Why we use banked Instruction Cache

Prof. Zhang Gang School of Computer Sci. & Tech.

Lecture: Cache Hierarchies

COMP4211 : Advance Computer Architecture

Lecture 22: Parallel Algorithms

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

Array Processor.

15-740/ Computer Architecture Lecture 13: More Caching

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.

Lecture: Cache Innovations, Virtual Memory

CSCI N207 Data Analysis Using Spreadsheet

Information Redundancy Fault Tolerant Computing

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Siddhartha Chatterjee

Lecture 15: Memory Design

Multidimensional array

Lecture 11: Cache Hierarchies

Mattan Erez The University of Texas at Austin

CUDA Execution Model – III Streams and Events

Lecture: Cache Hierarchies

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 498AL Lecture 10: Control Flow

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

CS703 - Advanced Operating Systems

Main Memory Background

ECE 498AL Spring 2010 Lecture 10: Control Flow

Lecture: Coherence and Synchronization

Cache Performance Improvements

Working with Arrays in MATLAB

William Stallings Computer Organization and Architecture

Presentation transcript:

Shared Memory Accesses ©Sudhakar Yalamanchili unless otherwise noted

Objectives Understand how memory access patterns in a warp can affect performance Develop an intuition about how to avoid performance degrading access patterns to shared memory

Reading CUDA Programming Guide CUDA: Best Practices http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract CUDA: Best Practices http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#axzz3kOab4Hrx

Avoiding Shared Memory Bank Conflicts Be cognizant of addressing patterns in a warp Accesses to the same banks are serialized Accesses to the same address on the same bank are broadcast (compute capability 5.0 and 6.0) Check the behavior of shared memory as a function of compute capability CUDA Programming Guide Section H http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Bank Conflicts Conflict free access Conflict free access 2-way Conflict access Conflict free access

Interleaved Memory Organization τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time word interleaving 1 2 3 4 5 6 7 Bank 0 Bank 1 Bank 2 Bank 3 Memory is organized into multiple, concurrent, banks World level interleaving across banks Single address generates multiple, concurrent accesses Well matched to cache line access patterns

Sequential Bank Operation bank 1 bank m-1 m lower order bits n-m higher order bits τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time

Concurrent Bank Operation memory bank access Memory Module τ Time Each bank can be addressed independently Multiple sources of addresses  warp? Difference with interleaved memory Flexibility in addressing Requires greater address bandwidth Separate controllers and memory buses Support for non-blocking caches with multiple outstanding misses

Data Skewing for Concurrent Access A 3-ordered 8 vector with C = 2. How can we guarantee that data can be accessed in parallel? Avoid bank conflicts Storage Scheme: A set of rules that determine for each array element, the address of the module and the location within a module Design a storage scheme to ensure concurrent access d-ordered n vector: the ith element is in module (d.i + C) mod M.

Conflict Free Access Conflict free access to elements of the vector if  M >= N M >= N. gcd(M,d) Multi-dimensional arrays treated as arrays of 1-d vectors Conflict free access for various patterns in a matrix requires M >= N. gcd(M,δ1) for columns M >= N. gcd(M, δ2) for rows M >= N. gcd(M, δ1+ δ2 ) for forward diagonals M >= N. gcd(M, δ1- δ2) for backward diagonals

Conflict Free Access Implications for M = N = even number? For non-power-of-two values of M, indexing and address computation must be efficient Vectors that are accessed are scrambled Unscrambling of vectors is a non-trivial performance issue Data dependencies can still reduce bandwidth far below O(M)

Avoiding Bank Conflicts Many banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses Solutions: Software: loop interchange Software: adjust array size to a prime # (“array padding”) Hardware: prime number of banks (e.g. 17) Data skewing 12

Questions?