GPU Architectures and CUDA in More Detail

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

Intermediate GPGPU Programming in CUDA
Multi-GPU and Stream Programming Kishan Wimalawarne.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Martin Kruliš by Martin Kruliš (v1.0)1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Extracted directly from:
GPU Programming with CUDA – Optimisation Mike Griffiths
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Synchronization These notes introduce:
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
Martin Kruliš by Martin Kruliš (v1.0)1.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
GPGPU Architectures Martin Kruliš Martin Kruliš
Multi-GPU Programming
Lecture 5: Performance Considerations
GPU Computing CIS-543 Lecture 10: Streams and Events
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
Heterogeneous Programming
5.2 Eleven Advanced Optimizations of Cache Performance
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Lecture 26: Multiprocessors
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
Lecture 27: Multiprocessors
Mattan Erez The University of Texas at Austin
CUDA Execution Model – III Streams and Events
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 5: Synchronization and ILP
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

GPU Architectures and CUDA in More Detail Martin Kruliš by Martin Kruliš (v1.0) 15. 4. 2019

SIMT Execution (Revision) Single Instruction Multiple Threads All cores are executing the same instruction Each core has its own set of registers registers Instruction Decoder and Warp Scheduler by Martin Kruliš (v1.0) 15. 4. 2019

Thread-Core Mapping (Revision) How are threads assigned to SMPs Grid Block Warp Thread The same kernel Simultaneously run on SM cores. Threads in a warp run in lockstep. Assigned to SM Warp is made by subsequent 32 threads of the block according to their thread ID (i.e., threads 0-31, 32-63, …). Warp size is 32 for all compute capabilities. Thread ID is threadIdx.x in one dimension, (threadIdx.x + threadIdx.y*blockDim.x) in two dimensions and threadId.x + threadId.y*blockDim.x + threadId.z*blockDim.x*blockDim.y) in three dimensions. Multiple kernels may run simultaneously on the GPU (since Fermi) and multiple blocks may be assigned simultaneously to an SM (if the registers, the schedulers, and the shared memory can accommodate them). Core GPU Streaming Multiprocessor by Martin Kruliš (v1.0) 15. 4. 2019

Instruction Schedulers Decomposition Each block assigned to the SMP is divided into warps and the warps are assigned to schedulers Schedulers Select warp that is ready at every instruction cycle The SMP instruction throughput depends on CC: 1.x – 1 instruction per 4 cycles, 1 scheduler 2.0 – 1 instruction per 2 cycles, 2 schedulers 2.1 – 2 instructions per 2 cycles, 2 schedulers 3.x and 5.x – 2 instructions per cycle, 4 schedulers The most common reason, why a warp is not ready to execute is that the operands of the next instruction are not available yet (i.e., they are being loaded from global memory, subjected to a read-after-write dependency, …). Maxwell (CC 5.0) has also 4x double-dealing schedulers (as Kepler), however, there are only 128 cores (assigned 32 cores per scheduler) and latency of math instructions has been improved. by Martin Kruliš (v1.0) 15. 4. 2019

Hiding Latency Fast Context Switch When a warp gets stalled E.g., by data load/store Scheduler switch to next active warp by Martin Kruliš (v1.0) 15. 4. 2019

SIMT and Branches Masking Instructions In case of data-driven branches if-else conditions, while loops, … All branches are traversed, threads mask their execution in invalid branches if (threadIdx.x % 2 == 0) { ... even threads code ... } else { ... odd threads code ... } 1 2 3 4 … 1 2 3 4 … by Martin Kruliš (v1.0) 15. 4. 2019

Reducing Thread Divergence Work Reorganization In case the workload is imbalanced Cheap balancing can lead to better occupancy Example Matrix with dimensions not divisible by warp size Item (i,j) has linear index i*width + j by Martin Kruliš (v1.0) 15. 4. 2019

Block-wise Synchronization Memory Fences __threadfence(); __threadfence_block() __threadfence_system(); Barrier Synchronization between warps in block __syncthreads(); __syncthreads_count(predicate); __syncthreads_and(predicate); __syncthreads_or(predicate); https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions __threadfence_block is weakest, __threadfence_system is strongest __syncthreads is stronger than threadfences by Martin Kruliš (v1.0) 15. 4. 2019

GPU Memory (Revision) Global Memory L2 Cache … Host Note that details about host memory interconnection are platform specific GPU Device Global Memory L2 Cache L1 Cache Registers Core … Host CPU Host Memory GPU Chip SMP ~ 25 GBps PCI Express (16/32 GBps) > 100 GBps by Martin Kruliš (v1.0) 15. 4. 2019

Global Memory Access Patterns Perfectly aligned sequential access by Martin Kruliš (v1.0) 15. 4. 2019

Global Memory Access Patterns Perfectly aligned with permutation by Martin Kruliš (v1.0) 15. 4. 2019

Global Memory Access Patterns Continuous sequential, but misaligned by Martin Kruliš (v1.0) 15. 4. 2019

Global Memory Coalesced Loads Impact by Martin Kruliš (v1.0) 15. 4. 2019

Shared Memory Memory Shared by SM Divided into banks Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks Optionally, 64-bit words division is used (CC 3.x) Bank conflicts are serialized Except for reading the same address (broadcast) In newer architectures (CC 5.x and 6.x), the size of the shared memory may vary a little, but the limit per thread block remains 48kB. Compute capability Mem. size # of banks latency 1.x 16 kB 16 32 bits / 2 cycles 2.x 48 kB 32 3.x 64 bits / 1 cycle by Martin Kruliš (v1.0) 15. 4. 2019

Shared Memory Linear Addressing Each thread in warp access different memory bank No collisions by Martin Kruliš (v1.0) 15. 4. 2019

Shared Memory Linear Addressing with Stride Each thread access 2*i-th item 2-way conflicts (2x slowdown) on CC < 3.0 No collisions on CC 3.x Due to 64-bits per cycle throughput by Martin Kruliš (v1.0) 15. 4. 2019

Shared Memory Linear Addressing with Stride Each thread access 3*i-th item No collisions, since the number of banks is not divisible by the stride by Martin Kruliš (v1.0) 15. 4. 2019

Shared Memory Broadcast One set of threads access value in bank #12 and the remaining threads access value in bank #20 Broadcasts are served independently on CC 1.x I.e., sample bellow causes 2-way conflict CC 2.x and newer serve broadcasts simultaneously by Martin Kruliš (v1.0) 15. 4. 2019

Registers Registers One register pool per multiprocessor 8-64k of 32-bit registers (depending on CC) Register allocation is defined by compiler All allocated blocks share the registry pool Register pressure (heavy utilization) may limit number of blocks running simultaneously It may also cause registry spilling As fast as the cores (no extra clock cycles) Read-after-write dependency 24 clock cycles Can be hidden if there are enough active warps by Martin Kruliš (v1.0) 15. 4. 2019

Local Memory Per-thread Global Memory Allocated automatically by compiler Compiler may report the amount of allocated local memory (use --ptxas-options=-v) Large local structures and arrays are places here Instead of the registers Register Pressure The registers are spilled into the local memory by Martin Kruliš (v1.0) 15. 4. 2019

Memory Allocation Global Memory Shared Memory cudaMalloc(), cudaFree() Dynamic kernel allocation malloc() and free() called from kernel cudaDeviceSetLimit(cudaLimitMallocHeapSize, size) Shared Memory Statically (e.g., __shared__ int foo[16];) Dynamically (by 3rd kernel launch parameter) extern __shared__ float bar[]; float *bar1 = &(bar[0]); float *bar2 = &(bar[size_of_bar1]); by Martin Kruliš (v1.0) 15. 4. 2019

Optimized for writing, not cached on CPU Page-locked Memory Page-locked (Pinned) Host Memory Host memory that is prevented from swapping Created/dismissed by cudaHostAlloc(), cudaFreeHost() cudaHostRegister(), cudaHostUnregister() Optionally with flags cudaHostAllocWriteCombined cudaHostAllocMapped cudaHostAllocPortable Copies between pinned host memory and device are automatically performed asynchronously Pinned memory is a scarce resource Optimized for writing, not cached on CPU On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining as described in Write-Combining Memory. Write-combining memory frees up the host's L1 and L2 cache resources, making more cache available to the rest of the application. In addition, write-combining memory is not snooped during transfers across the PCI Express bus, which can improve transfer performance by up to 40%. by Martin Kruliš (v1.0) 15. 4. 2019

Memory Mapping Device Memory Mapping Allowing GPU to access portions of host memory directly (i.e., without explicit copy operations) For both reading and writing The memory must be allocated/registered with flag cudaHostAllocMapped The context must have cudaDeviceMapHost flag (set by cudaSetDeviceFlags()) Function cudaHostGetDevicePointer() gets host pointer and returns corresponding device pointer Note: Atomic operations on mapped memory are not atomic from the host (or other devices) perspective. by Martin Kruliš (v1.0) 15. 4. 2019

Heterogeneous Programming GPU “Independent” device Controlled by host Used for “offloading” Host Code Needs to be designed in a way that Utilizes GPU(s) efficiently Utilize CPU while GPU is working CPU and GPU do not wait for each other Remember Amdahl’s Law. by Martin Kruliš (v1.0) 15. 4. 2019

Heterogeneous Programming Bad Example cudaMemcpy(..., HostToDevice); Kernel1<<<...>>>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost); ... Kernel2<<<...>>>(...); CPU GPU Device is doing something useful by Martin Kruliš (v1.0) 15. 4. 2019

Overlapping Work Overlapping CPU and GPU work Kernels Memory transfers Started asynchronously Can be waited for (cudaDeviceSynchronize()) A little more can be done with streams Memory transfers cudaMemcpy() is synchronous and blocking Alternatively cudaMemcpyAsync() starts the transfer and returns immediately Can be synchronized the same way as the kernel by Martin Kruliš (v1.0) 15. 4. 2019

Workload balance becomes an issue Overlapping Work Using Asynchronous Transfers cudaMemcpyAsync(HostToDevice); Kernel1<<<...>>>(...); cudaMemcpyAsync(DeviceToHost); ... do_something_on_cpu(); cudaDeviceSynchronize(); CPU GPU Workload balance becomes an issue by Martin Kruliš (v1.0) 15. 4. 2019

Streams Stream In-order GPU command queue Default stream (stream 0) Asynchronous GPU operations are registered in queue Kernel execution Memory data transfers Commands in different streams may overlap Provide means for explicit and implicit synchronization Default stream (stream 0) Always present, does not have to be created Global synchronization capabilities by Martin Kruliš (v1.0) 15. 4. 2019

Streams Stream Creation Stream Usage Stream Destruction cudaStream_t stream; cudaStreamCreate(&stream); Stream Usage cudaMemcpyAsync(dst, src, size, kind, stream); kernel<<<grid, block, sharedMem, stream>>>(...); Stream Destruction cudaStreamDestroy(stream); by Martin Kruliš (v1.0) 15. 4. 2019

Pipelining Making a Good Use of Overlapping Split the work into smaller fragments Create a pipeline effect (load, process, store) by Martin Kruliš (v1.0) 15. 4. 2019

Single precision floats only Instruction Set GPU Instruction Set Oriented for rendering and geometry calculations Rich set of mathematical functions Many of those are implemented as instructions Separate functions for doubles and floats e.g., sqrtf(float) and sqrt(double) Instruction behavior depends on compiler options -use_fast_math – fast but lower precision -ftz=bool – flush denormals to zero -prec-div=bool – precise float divisions -prec-sqrt=bool – precise float sqrts -fmad=bool – use mul-add instructions (e.g., FFMA) Single precision floats only by Martin Kruliš (v1.0) 15. 4. 2019

Atomic Instructions Atomic Instructions Perform read-modify-write operation of one 32bit or 64bit word in global or shared memory Require CC 1.1 or higher (1.2 for 64bit global atomics and 2.0 for 64bit shared mem. atomics) Operate on integers, except for atomicExch() and atomicAdd() which also work on 32bit floats Atomic operations on mapped memory are atomic only from the perspective of the device Since they are usually performed on L2 cache by Martin Kruliš (v1.0) 15. 4. 2019

Atomic Instructions Atomic Instructions Overview atomicAdd(&p,v), atomicSub(&p,v) – atomically adds or subtracts v to/from p and return the old value atomicInc(&p,v), atomicDec(&p,v) – atomic increment/decrement computed modulo v atomicMin(&p,v), atomicMax(&p,v) atomicExch(&p,v) – atomically swaps a value atomicCAS(&p,v) – classical compare-and-set atomicAnd(&p,v), atomicOr(&p,v), atomicXor(&p,v) – atomic bitwise operations by Martin Kruliš (v1.0) 15. 4. 2019

Warp Functions Voting Instructions Intrinsics that allows the whole warp to perform reduction and broadcast in one step Only active threads are participating __all(predicate) All active threads evaluate predicate Returns non-zero if ALL predicates returned non-zero __any(predicate) Like __all, but the results are combined by logical OR __ballot(predicate) Return bitmask, where each bit represents the predicate result of the corresponding thread by Martin Kruliš (v1.0) 15. 4. 2019

Warp Functions Warp Shuffle Instructions Fast variable exchange within the warp Available for architectures with CC 3.0 or newer Intrinsics __shfl_sync() – direct copy from given lane __shfl_up_sync() – copy with lower relative ID __shfl_down_sync() – copy with higher relative ID __shfl_xor_sync() – copy from a lane, which ID is computed as XOR of caller ID All functions have optional width parameter, that allows to divide warp into smaller segments by Martin Kruliš (v1.0) 15. 4. 2019

Summary Right amount of threads SIMT Memory Saturate SMPs, but avoid registry spilling SIMT Avoid warp divergence Synchronization within a block is cheap Memory Host-device transfer overlapping Global memory coalesced transactions Shared memory banking by Martin Kruliš (v1.0) 15. 4. 2019

Discussion by Martin Kruliš (v1.0) 15. 4. 2019