GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1) 03.11.2016.

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

Intermediate GPGPU Programming in CUDA
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Martin Kruliš by Martin Kruliš (v1.0)1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Extracted directly from:
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
CUDA - 2.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Sunpyo Hong, Hyesoon Kim
My Coordinates Office EM G.27 contact time:
Martin Kruliš by Martin Kruliš (v1.0)1.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
Lecture 5: Performance Considerations
Nios II Processor: Memory Organization and Access
GPU Computing CIS-543 Lecture 08: CUDA Memory Model
Memory Management.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Processes and threads.
CUDA Programming Model
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memories These notes will introduce:
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
Heterogeneous Programming
Basic CUDA Programming
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture 4: GPU Memory Systems
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Chapter 8: Main Memory.
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Lecture 4: GPU Memory Systems
Lecture 5: Synchronization and ILP
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
GPU Architectures and CUDA in More Detail
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Page Main Memory.
Presentation transcript:

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1) 03.11.2016

Overview Global Memory L2 Cache … Host Note that details about host memory interconnection are platform specific. GPU Device Global Memory L2 Cache L1 Cache Registers Core … Host CPU Host Memory GPU Chip SMP ~ 25 GBps PCI Express (16/32 GBps) > 100 GBps by Martin Kruliš (v1.1) 03.11.2016

Host-Device Transfers PCIe Transfers Much slower than internal GPU data transfers Issued explicitly by host code cudaMemcpy(dst, src, size, direction); With one exception, when the GPU memory is mapped to the host memory space The transfer call has significant overhead Bulk transfers are preferred Overlapping Up to 2 async. transfers while the GPU is computing by Martin Kruliš (v1.1) 03.11.2016

Global Memory Global Memory Properties Off-chip, but on the GPU device High bandwidth and high latency ~ 100 GBps, 400-600 of clock cycles Operated in transactions Continuous aligned segments of 32 B - 128 B Number of transaction depends on caching model, GPU architecture, and memory access pattern by Martin Kruliš (v1.1) 03.11.2016

Global Memory Global Memory Caching Data are cached in L2 cache Relatively small (up to 2MB on new Maxwell GPUs) On CC < 3.0 (Fermi) also cached in L1 cache Configurable by compiler flag -Xptxas -dlcm=ca (Cache Always, i.e. also in L1, default) -Xptxas -dlcm=cg (Cache Global, i.e. L2 only) CC 3.x (Kepler) reserves L1 for local memory caching and registry spilling CC 5.x (Maxwell) separates L1 cache from shared memory and unifies it with texture cache by Martin Kruliš (v1.1) 03.11.2016

Global Memory Coalesced Transfers Number of transactions caused by global memory access depends on the pattern of the access Certain access patterns are optimized CC 1.x Threads sequentially access aligned memory block Subsequent threads access subsequent words CC 2.0 and later Threads access aligned memory block Access within the block can be permuted by Martin Kruliš (v1.1) 03.11.2016

Global Memory Access Patterns Perfectly aligned sequential access by Martin Kruliš (v1.1) 03.11.2016

Global Memory Access Patterns Perfectly aligned with permutation by Martin Kruliš (v1.1) 03.11.2016

Global Memory Access Patterns Continuous sequential, but misaligned by Martin Kruliš (v1.1) 03.11.2016

Global Memory Coalesced Loads Impact by Martin Kruliš (v1.1) 03.11.2016

Shared Memory Memory Shared by SM Divided into banks Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks Optionally, 64-bit words division is used (CC 3.x) Bank conflicts are serialized Except for reading the same address (broadcast) In newer architectures (CC 5.x and 6.x), the size of the shared memory may vary a little, but the limit per thread block remains 48kB. Compute capability Mem. size # of banks latency 1.x 16 kB 16 32 bits / 2 cycles 2.x 48 kB 32 3.x 64 bits / 1 cycle by Martin Kruliš (v1.1) 03.11.2016

Shared Memory Linear Addressing Each thread in warp access different memory bank No collisions by Martin Kruliš (v1.1) 03.11.2016

Shared Memory Linear Addressing with Stride Each thread access 2*i-th item 2-way conflicts (2x slowdown) on CC < 3.0 No collisions on CC 3.x Due to 64-bits per cycle throughput by Martin Kruliš (v1.1) 03.11.2016

Shared Memory Linear Addressing with Stride Each thread access 3*i-th item No collisions, since the number of banks is not divisible by the stride by Martin Kruliš (v1.1) 03.11.2016

Shared Memory Broadcast One set of threads access value in bank #12 and the remaining threads access value in bank #20 Broadcasts are served independently on CC 1.x I.e., sample bellow causes 2-way conflict CC 2.x and newer serve broadcasts simultaneously by Martin Kruliš (v1.1) 03.11.2016

Shared Memory Shared Memory vs. L1 Cache Shared Memory Configuration On CC 2.x and 3.x, they are the same resource Division can be set for each kernel by cudaFuncSetCacheConfig(kernel, cacheConfig); Cache configuration can prefer L1 or shared memory (i.e., selecting 48kB of 64kB for the preferred) Shared Memory Configuration Some devices (CC 3.x) can configure memory banks cudaFuncSetSharedMemConfig(kernel, config); The config selects between 32 bit and 64 bit mode The 32bit mode on CC 3.x devices have one strange feature: If two threads access different addresses in the same bank, but both addresses are in an aligned 64 word (i.e., 32 bit word) block – index of second is index of first + 32 – the memory can handle the requests without a collision. Note that Maxwell (CC 5.x) returned to previous (Fermi) configuration – i.e., the bank size is not configurable and fixed for 32bit words. by Martin Kruliš (v1.1) 03.11.2016

Registers Registers One register pool per multiprocessor 8-64k of 32-bit registers (depending on CC) Register allocation is defined by compiler As fast as the cores (no extra clock cycles) Read-after-write dependency 24 clock cycles Can be hidden if there are enough active warps Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible The programmer have no direct control over conflicts by Martin Kruliš (v1.1) 03.11.2016

Local Memory Per-thread Global Memory Allocated automatically by compiler Compiler may report the amount of allocated local memory (use --ptxas-options=-v) Large local structures and arrays are places here Instead of the registers Register Pressure There is not enough registers to accommodate the data of the thread The registers are spilled into the local memory Can be moderated selecting smaller thread blocks by Martin Kruliš (v1.1) 03.11.2016

Constant and Texture Memory Constant Memory Special 64KB cache for read-only data 8KB is the cache working set per multiprocessor CC 2.x introduces LDU (LoaD Uniform) instruction Compiler uses to force loading read-only variables that are thread-independent into the cache Texture Memory Texture cache is optimized for 2D spatial locality Additional functionality like fast data interpolation, normalized coordinate system, or handling the boundary cases by Martin Kruliš (v1.1) 03.11.2016

Memory Allocation Global Memory Shared Memory cudaMalloc(), cudaFree() Dynamic kernel allocation malloc() and free() called from kernel cudaDeviceSetLimit(cudaLimitMallocHeapSize, size) Shared Memory Statically (e.g., __shared__ int foo[16];) Dynamically (by kernel launch parameter) extern __shared__ float bar[]; float *bar1 = &(bar[0]); float *bar2 = &(bar[size_of_bar1]); by Martin Kruliš (v1.1) 03.11.2016

Implications and Guidelines Global Memory Data should be accessed in coalesced manner Hot data should be manually cached in shared mem Shared Memory Bank conflicts need to be avoided Redesigning data structures in col-wise manner Using strides that are not divisible by # of banks Registers and Local Memory Use as few as possible, avoid registry spilling by Martin Kruliš (v1.1) 03.11.2016

Implications and Guidelines Memory Caching The structures should be designed to utilize caches in best way possible The workset of active blocks should fit L2 cache Providing maximum information for the compiler Using const for constant data Using __restrict__ to indicate that no pointer aliasing will occur Data Alignment Operate on 32bit/64bit values only Align data structures to suitable powers of 2 by Martin Kruliš (v1.1) 03.11.2016

Maxwell Architecture What is new in Maxwell…. L1 merges with texture cache Data are cached in L1 the same way as in Fermi Shared memory is independent 64k or 96k not shared with L1 Shared memory uses 32bit banks Revert to Fermi-like style, keeping the aggregated bandwidth Faster shared memory atomic operations by Martin Kruliš (v1.1) 03.11.2016

Discussion by Martin Kruliš (v1.1) 03.11.2016