Overview of GPGPU Architecture and Programming Paradigm
GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm
Basic Blocks Several shader cores/streaming multiprocessor (SM) Interconnection network On-chip memory controllers On-chip caches (level1/2) Off-chip DRAM
… Basic Blocks Hardware thread scheduling MC0 MC1 MC2 MC3 MCL DRAM L2 … INTERCONNECT SM Texture Processor Cluster0 Texture Processor Cluster1 Texture Processor ClusterM GPU Kernels … matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); Streaming Multiprocessor Thread Scheduler Texture Cache Instruction Cache Constant Cache High BW on-chip network Compile with CUDA compiler Decoder Thread batch - HW unit of thread execution (Warp - Nvidia) (Wavefront - ATI) Shared Memory … mov.s32 %r14, 15; and.b32 %r15, %r13, %r14; add.s32 %r16, %r15, %r12; shr.s32 %r17, %r16, 4; ... SP SP SP … SP SP SP SP … SP Memory Controllers SP SP SP … SP Traditional GPU is a GP throughput architecture - Simple processor architecture but more number of processors Complex interconnect Has hardware scheduler too Large scratch pad memory (software managed cache) Large register file (context switch is cheap) Single set of instructions are used by all threads Light weight threads Warp / wavefornt/ threadbatch hardware thread execution pipeline Hardware thread scheduling Threads have dedicated registers Shared memory among thread block Same PC for all threads in warp Separate ALU and memory pipeline PTX assembly Register File … … … … … SP SP SP … SP Off-chip memory array Light weight threads grouped into thread-blocks
Streaming Multiprocessor Multi thread unit Instruction cache/decoder Several single processor (SP) Load-store/SFU units Large register file Shared memory Shared texture caches Constant cache
MT- unit (Global Block) scheduler Examples : G80 and GT200 MT- unit (Global Block) scheduler TPC – texture processor cluster (group of SM share same texture unit) 2 GPU generation G80 and GT200 shown
GT300 (Fermi) Examples : GT300 Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache) Fermi Streaming Multiprocessor (SM) http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Comparison G80 vs. GT200 vs. GT300 http://www.dvhardware.net/article38173.html
Example: GK110 (Kepler Architecture) More power efficient than Fermi. New SM architecture (SMX). Revamped memory architecture. Hardware support for new programing models. Capable of Dynamic Parallelism. Source: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
Basic GPGPU Processor Pipeline Simple in-order execution in SIMT Single instruction multiple threads Scheduler chooses one of several warps (PC) Fetches 1 instruction from the I$ per warp Decodes the instruction, reads register and dispatches Scoreboard maintains dependencies Multi-ported register file provides data for all lanes Numerous ALU, FPU, LD/ST, SFU lanes run simultaneously (different speeds) Writeback updates the register file. SIMT A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge.
GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm
Inside Streaming Multiprocessor Streaming Multiprocessor (G80) 8 Streaming Processors (SP) 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 16 KB shared memory
Register File 8192 registers in each SM in G80 Implementation decision, not part of programming abstraction Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Threads access registers assigned to itself
Thread Dispatch Policy Hierarchy of grid of blocks of threads Blocks are serially distributed to SM Potentially >1 Block/SM SM launches Warps (32 threads) 2 levels of parallelism Round-robin, ready-to-execute scheduling policy Figure source – Nvidia CUDA Programming Guide 2.3
Block Execution : Software View SM block execution (G80) Assignment in block granularity Up to 8 blocks/SM as resource allows SM in G80 can take up to 768 threads 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently
Block Execution : Software View Automatic Scalability
Block Execution : Hardware View Blocks divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM 3 blocks/SM and each Block 256 threads, then Block is divided into 256/32 = 8 Warps Total 8 * 3 = 24 Warps Only 1 of the 24 Warps will be selected for instruction fetch and execution.
Zero-overhead Context Switching Next warp Warp Scheduling I Zero-overhead Context Switching Next warp Instruction which has it’s operands ready Eligible warps Prioritized scheduling policy (no details available) Threads in Warp execute the same instruction
Warp Scheduling II
Latency Hiding 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions 13 Warps to hide 200-cycle DRAM latency
Hide read-after-write latency G80 Pipeline ~30 stages: fetch, decode, gather and write-back act on whole warps -throughput of 1 warp/slow clock Execute acts on group of 8 threads (only 8 SP/SM), throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks (see next slide) Fetch/decode stages have a higher throughput to feed both the MAD and the SFU/MUL units. Peak rate of 8 MAD + 8 MUL per (fast) clock cycle 8 blk / SM = 512 thd / blk = 32 thd warp / 8 SP Hide read-after-write latency 1 memory access / 2 instructions 32 cycles / memory access 200 / 32 = 6 warps These pipeline cycle counts are retrieved from Nvidia forum and there is not available document to verify the figures
Execute Stage (no memory access) Inst#1-SP 1 Thread -1 Inst#1-SP 2 Thread -2 Inst#1-SP 3 Thread -3 Inst#1-SP 4 Thread -4 Inst#1-SP 5 Thread -5 Inst#1-SP 6 Thread -6 Inst#1-SP 7 Thread -7 Inst#1-SP 8 Thread -8 Inst#1-SP 1 Thread -9 Inst#1-SP 2 Thread -10 Inst#1-SP 3 Thread -11 Inst#1-SP 4 Thread -12 Inst#1-SP 5 Thread -13 Inst#1-SP 6 Thread -14 Inst#1-SP 7 Thread -15 Inst#1-SP 8 Thread -16 Inst#1-SP 1 Thread -17 Inst#1-SP 2 Thread -18 Inst#1-SP 3 Thread -19 Inst#1-SP 4 Thread -20 Inst#1-SP 5 Thread -21 Inst#1-SP 6 Thread -22 Inst#1-SP 7 Thread -23 Inst#1-SP 8 Thread -24 Inst#1-SP 1 Thread -25 Inst#1-SP 2 Thread -26 Inst#1-SP 3 Thread -27 Inst#1-SP 4 Thread -28 Inst#1-SP 5 Thread -29 Inst#1-SP 6 Thread -30 Inst#1-SP 7 Thread -31 Inst#1-SP 8 Thread -32 Inst#2-SP 1 Thread -1 Inst#2-SP 2 Thread -2 Inst#2-SP 3 Thread -3 Inst#2-SP 4 Thread -4 Inst#2-SP 5 Thread -5 Inst#2-SP 6 Thread -6 Inst#2-SP 7 Thread -7 Inst#2-SP 8 Thread -8 Inst#2-SP 1 Thread -9 Inst#2-SP 2 Thread -10 Inst#2-SP 3 Thread -11 Inst#2-SP 4 Thread -12 Inst#2-SP 5 Thread -13 Inst#2-SP 6 Thread -14 Inst#2-SP 7 Thread -15 Inst#2-SP 8 Thread -16 Inst#2-SP 1 Thread -17 Inst#2-SP 2 Thread -18 Inst#2-SP 3 Thread -19 Inst#2-SP 4 Thread -20 Inst#2-SP 5 Thread -21 Inst#2-SP 6 Thread -22 Inst#2-SP 7 Thread -23 Inst#2-SP 8 Thread -24 Inst#2-SP 1 Thread -25 Inst#2-SP 2 Thread -26 Inst#2-SP 3 Thread -27 Inst#2-SP 4 Thread -28 Inst#2-SP 5 Thread -29 Inst#2-SP 6 Thread -30 Inst#2-SP 7 Thread -31 Inst#2-SP 8 Thread -32 SP 1 SP 2 SP 3 SP 4 SP 5 SP 6 SP 7 SP 8
Instruction Buffer Fetch 1 warp instruction/cycle From instruction L1 cache Into any instruction buffer slot Issue 1 “ready-to-go” warp instruction/cycle From any warp - instruction buffer slot Operand scoreboarding used to prevent hazards Issue selection based on round-robin / age of warp among “ready-to-go” warps SM broadcasts the same instruction to 32 Threads of a Warp
Scoreboarding Register operands in the Instruction Buffer are scoreboarded Instruction becomes ready when needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled Memory/Processor pipelines Continue to issue instructions until scoreboarding prevents Allows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops
Pathology: Warp Divergence Conditional branches splits 32-thread warps Diverged threads serializes Some SPs are idle in serialized warps Lowers performance significantly
GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm
Memory Hierarchy Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can R/W global, constant, and texture memories using Copy function Figure source – Nvidia CUDA Programming Guide 2.3
Immediate address constants Indexed address constants Constant Memory Immediate address constants Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM Constants broadcast to all threads in a Warp Efficient way of accessing a value that is common for all threads in a block! I $ L 1 Multithreaded Instruction Buffer R C $ Shard F L 1 Mem Operand Select MAD SFU
Each SM has 16 KB of Shared Memory 16 banks of 32bit words CUDA uses Shared Memory as shared storage visible to all threads in a thread block Fast read and write access Not used explicitly for pixel shader programs we dislike pixels talking to each other I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU
Memory Banking Parallel Memory Architecture Memory is divided into banks Essential to achieve high bandwidth Banks can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0
No Bank Conflicts No Bank Conflicts No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0
Bank Conflicts 2-way Bank Conflicts 8-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0
Memory Banks in G80 G80 DRAM bank conflicts Bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp Memory access scheduling are half-wrap based No bank conflicts between different half-warps, only within a single half-warp.
Memory Access Optimizations Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank
Memory Access Optimizations - continues Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd s=1 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 s=3 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0
Global Memory vs. Shared Memory Use coalesced reads. The bandwidth of global memory interface is 64 bytes, hence can coalesce 16*32 bit data from half warp. Or read using a texture with good spatial locality within each warp. Use multiple (16) banks for independent accesses from half warp.
GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm
GPGPU μarch : Interconnect Topology GPU Interconnect DRAM controller is on-chip DRAM is off-chip
GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm
Device Memory Management cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Figure source – Nvidia CUDA Programming Guide 2.3
Device Memory Management Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md “d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md);
Host-Device Communication cudaMemcpy() Memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfer Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Figure source – Nvidia CUDA Programming Guide 2.3
Host - Device Communication Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);
GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm
Programming Abstraction I Threads are grouped as Grids of Blocks of Threads 3D indexing for each threads Kernel executions are serialized Figure source – Nvidia CUDA Programming Guide 2.3
Programming Paradigm: Compute Unified Device Architecture Hierarchical thread organization Each level has special purpose Instruction sharing (top) Resource sharing Execution unit sharing (bottom) Defines single thread behavior Thread identifier distinguishes data Compiler optimization difficult Hardware resource sharing varies in different layers Typical Throughput Kernel Launch Example CUDA/OpenCL etc implements programming model for throughput computing. Hierarchically threads are organized at different levels of sharing abstraction. Kernel -> Single thread instructions are defined (SIMT). Instruction sharing Block -> Resource sharing (shared memory, reg file, etc) Warp -> ALU sharing
Programming Abstraction II Source: David Kirk & Wen-mei Hwu lectures (http://courses.ece.illinois.edu/ece498/al/Syllabus.html)
CUDA – Square Matrix Multiplication P = M * N Each of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH times from global memory
CUDA – Square Matrix Multiplication Memory layout of a matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
CUDA – Square Matrix Multiplication CPU Version in C void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; DOT-product loop
CUDA – Square Matrix Multiplication GPU Version in CUDA void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); 2. // Kernel invocation code – to be shown later 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } Input matrix data transfer Output matrix data transfer
CUDA – Square Matrix Multiplication // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Pvalue is used to store the elements // that is computed by the thread float Pvalue = 0; for(int k = 0;k<Width; ++k) float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y*Width +threadIdx.x] = Pvalue; } Nd k DOT-product loop tx WIDTH Md Pd ty ty WIDTH tx k WIDTH WIDTH
CUDA – Square Matrix Multiplication One Block of threads compute matrix Pd Each thread computes 1 - Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one MAD for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by number of threads allowed in a thread block – see next slide Nd Grid 1 Block 1 Thread (2, 2) 48 Pd WIDTH Md
CUDA – Square Matrix Multiplication Each 2D thread block computes a (BLOCK_SIZE)2 sub-matrix (tile) of the result matrix Blocks have (BLOCK_SIZE)2 threads Generate a 2D Grid of blocks with (WIDTH/BLOCK_SIZE)2 Note, you still need to put a loop around the kernel call for cases where WIDTH/BLOCK_SIZE is greater than max grid size (64K)!
Software Stack & Compilation Trajectory Figure source - Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt, ISPASS-2009
PTX ISA Virtual ISA for Nvidia GPUs Stable ISA that spans multiple GPU generations Machine-independent ISA for C/C++ and other compilers to target Code distribution ISA for application and middleware developers Facilitate hand-coding of libraries, performance kernels, and architecture tests
Benefits of Parallel Thread eXecution ISA Adaptable Virtual ISA for Nvidia GPUs Stable ISA that spans multiple GPU generations Scalable and Machine-independent ISA for C/C++ and other compilers to target Code distribution ISA for application and middleware developers Facilitate hand-coding of libraries, performance kernels, and architecture tests