Overview of GPGPU Architecture and Programming Paradigm

Name: Overview of GPGPU Architecture and Programming Paradigm
Uploaded: 2017-08-19T04:25:59+00:00
Duration: PTM41S17
Channel: Candice Stokes
Description: Overview of GPGPU Architecture and Programming Paradigm

Overview of GPGPU Architecture and Programming Paradigm

GPGPU Architecture Overview Core Architecture Memory Hierarchy
Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm

Basic Blocks Several shader cores/streaming multiprocessor (SM) Interconnection network On-chip memory controllers On-chip caches (level1/2) Off-chip DRAM

… Basic Blocks Hardware thread scheduling
MC0 MC1 MC2 MC3 MCL DRAM L2 … INTERCONNECT SM Texture Processor Cluster0 Texture Processor Cluster1 Texture Processor ClusterM GPU Kernels … matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); Streaming Multiprocessor Thread Scheduler Texture Cache Instruction Cache Constant Cache High BW on-chip network Compile with CUDA compiler Decoder Thread batch - HW unit of thread execution (Warp - Nvidia) (Wavefront - ATI) Shared Memory … mov.s32 %r14, 15; and.b32 %r15, %r13, %r14; add.s32 %r16, %r15, %r12; shr.s32 %r17, %r16, 4; ... SP SP SP … SP SP SP SP … SP Memory Controllers SP SP SP … SP Traditional GPU is a GP throughput architecture - Simple processor architecture but more number of processors Complex interconnect Has hardware scheduler too Large scratch pad memory (software managed cache) Large register file (context switch is cheap) Single set of instructions are used by all threads Light weight threads Warp / wavefornt/ threadbatch hardware thread execution pipeline Hardware thread scheduling Threads have dedicated registers Shared memory among thread block Same PC for all threads in warp Separate ALU and memory pipeline PTX assembly Register File … … … … … SP SP SP … SP Off-chip memory array Light weight threads grouped into thread-blocks

Streaming Multiprocessor
Multi thread unit Instruction cache/decoder Several single processor (SP) Load-store/SFU units Large register file Shared memory Shared texture caches Constant cache

MT- unit (Global Block) scheduler
Examples : G80 and GT200 MT- unit (Global Block) scheduler TPC – texture processor cluster (group of SM share same texture unit) 2 GPU generation G80 and GT200 shown

GT300 (Fermi) Examples : GT300
Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache) Fermi Streaming Multiprocessor (SM)

Comparison G80 vs. GT200 vs. GT300

Example: GK110 (Kepler Architecture)
More power efficient than Fermi. New SM architecture (SMX). Revamped memory architecture. Hardware support for new programing models. Capable of Dynamic Parallelism. Source:

Basic GPGPU Processor Pipeline
Simple in-order execution in SIMT Single instruction multiple threads Scheduler chooses one of several warps (PC) Fetches 1 instruction from the I$ per warp Decodes the instruction, reads register and dispatches Scoreboard maintains dependencies Multi-ported register file provides data for all lanes Numerous ALU, FPU, LD/ST, SFU lanes run simultaneously (different speeds) Writeback updates the register file. SIMT  A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge.

Inside Streaming Multiprocessor
Streaming Multiprocessor (G80) 8 Streaming Processors (SP) 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 16 KB shared memory

Register File 8192 registers in each SM in G80
Implementation decision, not part of programming abstraction Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Threads access registers assigned to itself

Thread Dispatch Policy
Hierarchy of grid of blocks of threads Blocks are serially distributed to SM Potentially >1 Block/SM SM launches Warps (32 threads) 2 levels of parallelism Round-robin, ready-to-execute scheduling policy Figure source – Nvidia CUDA Programming Guide 2.3

Block Execution : Software View
SM block execution (G80) Assignment in block granularity Up to 8 blocks/SM as resource allows SM in G80 can take up to 768 threads 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently

Block Execution : Software View
Automatic Scalability

Block Execution : Hardware View
Blocks divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM 3 blocks/SM and each Block 256 threads, then Block is divided into 256/32 = 8 Warps Total 8 * 3 = 24 Warps Only 1 of the 24 Warps will be selected for instruction fetch and execution.

Zero-overhead Context Switching Next warp
Warp Scheduling I Zero-overhead Context Switching Next warp Instruction which has it’s operands ready Eligible warps Prioritized scheduling policy (no details available) Threads in Warp execute the same instruction

Warp Scheduling II

Latency Hiding 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions 13 Warps to hide 200-cycle DRAM latency

Hide read-after-write latency
G80 Pipeline ~30 stages: fetch, decode, gather and write-back act on whole warps -throughput of 1 warp/slow clock Execute acts on group of 8 threads (only 8 SP/SM), throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks (see next slide) Fetch/decode stages have a higher throughput to feed both the MAD and the SFU/MUL units. Peak rate of 8 MAD + 8 MUL per (fast) clock cycle 8 blk / SM = 512 thd / blk = 32 thd warp / 8 SP Hide read-after-write latency 1 memory access / 2 instructions 32 cycles / memory access 200 / 32 = 6 warps These pipeline cycle counts are retrieved from Nvidia forum and there is not available document to verify the figures

Execute Stage (no memory access)
Inst#1-SP 1 Thread -1 Inst#1-SP 2 Thread -2 Inst#1-SP 3 Thread -3 Inst#1-SP 4 Thread -4 Inst#1-SP 5 Thread -5 Inst#1-SP 6 Thread -6 Inst#1-SP 7 Thread -7 Inst#1-SP 8 Thread -8 Inst#1-SP 1 Thread -9 Inst#1-SP 2 Thread -10 Inst#1-SP 3 Thread -11 Inst#1-SP 4 Thread -12 Inst#1-SP 5 Thread -13 Inst#1-SP 6 Thread -14 Inst#1-SP 7 Thread -15 Inst#1-SP 8 Thread -16 Inst#1-SP 1 Thread -17 Inst#1-SP 2 Thread -18 Inst#1-SP 3 Thread -19 Inst#1-SP 4 Thread -20 Inst#1-SP 5 Thread -21 Inst#1-SP 6 Thread -22 Inst#1-SP 7 Thread -23 Inst#1-SP 8 Thread -24 Inst#1-SP 1 Thread -25 Inst#1-SP 2 Thread -26 Inst#1-SP 3 Thread -27 Inst#1-SP 4 Thread -28 Inst#1-SP 5 Thread -29 Inst#1-SP 6 Thread -30 Inst#1-SP 7 Thread -31 Inst#1-SP 8 Thread -32 Inst#2-SP 1 Thread -1 Inst#2-SP 2 Thread -2 Inst#2-SP 3 Thread -3 Inst#2-SP 4 Thread -4 Inst#2-SP 5 Thread -5 Inst#2-SP 6 Thread -6 Inst#2-SP 7 Thread -7 Inst#2-SP 8 Thread -8 Inst#2-SP 1 Thread -9 Inst#2-SP 2 Thread -10 Inst#2-SP 3 Thread -11 Inst#2-SP 4 Thread -12 Inst#2-SP 5 Thread -13 Inst#2-SP 6 Thread -14 Inst#2-SP 7 Thread -15 Inst#2-SP 8 Thread -16 Inst#2-SP 1 Thread -17 Inst#2-SP 2 Thread -18 Inst#2-SP 3 Thread -19 Inst#2-SP 4 Thread -20 Inst#2-SP 5 Thread -21 Inst#2-SP 6 Thread -22 Inst#2-SP 7 Thread -23 Inst#2-SP 8 Thread -24 Inst#2-SP 1 Thread -25 Inst#2-SP 2 Thread -26 Inst#2-SP 3 Thread -27 Inst#2-SP 4 Thread -28 Inst#2-SP 5 Thread -29 Inst#2-SP 6 Thread -30 Inst#2-SP 7 Thread -31 Inst#2-SP 8 Thread -32 SP 1 SP 2 SP 3 SP 4 SP 5 SP 6 SP 7 SP 8

Instruction Buffer Fetch 1 warp instruction/cycle
From instruction L1 cache Into any instruction buffer slot Issue 1 “ready-to-go” warp instruction/cycle From any warp - instruction buffer slot Operand scoreboarding used to prevent hazards Issue selection based on round-robin / age of warp among “ready-to-go” warps SM broadcasts the same instruction to 32 Threads of a Warp

Scoreboarding Register operands in the Instruction Buffer are scoreboarded Instruction becomes ready when needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled Memory/Processor pipelines Continue to issue instructions until scoreboarding prevents Allows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops

Pathology: Warp Divergence
Conditional branches splits 32-thread warps Diverged threads serializes Some SPs are idle in serialized warps Lowers performance significantly

Memory Hierarchy Each thread can: R/W per-thread registers
R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can R/W global, constant, and texture memories using Copy function Figure source – Nvidia CUDA Programming Guide 2.3

Immediate address constants Indexed address constants
Constant Memory Immediate address constants Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM Constants broadcast to all threads in a Warp Efficient way of accessing a value that is common for all threads in a block! I $ L 1 Multithreaded Instruction Buffer R C $ Shard F L 1 Mem Operand Select MAD SFU

Each SM has 16 KB of Shared Memory 16 banks of 32bit words
CUDA uses Shared Memory as shared storage visible to all threads in a thread block Fast read and write access Not used explicitly for pixel shader programs we dislike pixels talking to each other  I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Memory Banking Parallel Memory Architecture
Memory is divided into banks Essential to achieve high bandwidth Banks can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

No Bank Conflicts No Bank Conflicts No Bank Conflicts
Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Bank Conflicts 2-way Bank Conflicts 8-way Bank Conflicts
Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Memory Banks in G80 G80 DRAM bank conflicts
Bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp Memory access scheduling are half-wrap based No bank conflicts between different half-warps, only within a single half-warp.

Memory Access Optimizations
Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank

Memory Access Optimizations - continues
Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd s=1 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 s=3 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Global Memory vs. Shared Memory
Use coalesced reads. The bandwidth of global memory interface is 64 bytes, hence can coalesce 16*32 bit data from half warp. Or read using a texture with good spatial locality within each warp. Use multiple (16) banks for independent accesses from half warp.

GPGPU μarch : Interconnect Topology
GPU Interconnect DRAM controller is on-chip DRAM is off-chip

Device Memory Management
cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory Figure source – Nvidia CUDA Programming Guide 2.3

Device Memory Management
Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md “d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md);

Host-Device Communication
cudaMemcpy()‏ Memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfer Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory Figure source – Nvidia CUDA Programming Guide 2.3

Host - Device Communication
Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

Programming Abstraction I
Threads are grouped as Grids of Blocks of Threads 3D indexing for each threads Kernel executions are serialized Figure source – Nvidia CUDA Programming Guide 2.3

Programming Paradigm: Compute Unified Device Architecture
Hierarchical thread organization Each level has special purpose Instruction sharing (top) Resource sharing Execution unit sharing (bottom) Defines single thread behavior Thread identifier distinguishes data Compiler optimization difficult Hardware resource sharing varies in different layers Typical Throughput Kernel Launch Example CUDA/OpenCL etc implements programming model for throughput computing. Hierarchically threads are organized at different levels of sharing abstraction. Kernel -> Single thread instructions are defined (SIMT). Instruction sharing Block -> Resource sharing (shared memory, reg file, etc) Warp -> ALU sharing

Programming Abstraction II
Source: David Kirk & Wen-mei Hwu lectures (

CUDA – Square Matrix Multiplication
P = M * N Each of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH times from global memory

Memory layout of a matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

CPU Version in C void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; DOT-product loop

GPU Version in CUDA void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏ { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); 2. // Kernel invocation code – to be shown later 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } Input matrix data transfer Output matrix data transfer

// Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the elements // that is computed by the thread float Pvalue = 0; for(int k = 0;k<Width; ++k)‏ float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y*Width +threadIdx.x] = Pvalue; } Nd k DOT-product loop tx WIDTH Md Pd ty ty WIDTH tx k WIDTH WIDTH

One Block of threads compute matrix Pd Each thread computes Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one MAD for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by number of threads allowed in a thread block – see next slide Nd Grid 1 Block 1 Thread (2, 2)‏ 48 Pd WIDTH Md

Each 2D thread block computes a (BLOCK_SIZE)2 sub-matrix (tile) of the result matrix Blocks have (BLOCK_SIZE)2 threads Generate a 2D Grid of blocks with (WIDTH/BLOCK_SIZE)2 Note, you still need to put a loop around the kernel call for cases where WIDTH/BLOCK_SIZE is greater than max grid size (64K)!

Software Stack & Compilation Trajectory
Figure source - Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt, ISPASS-2009

PTX ISA Virtual ISA for Nvidia GPUs Stable ISA that spans multiple GPU generations Machine-independent ISA for C/C++ and other compilers to target Code distribution ISA for application and middleware developers Facilitate hand-coding of libraries, performance kernels, and architecture tests

Benefits of Parallel Thread eXecution ISA
Adaptable Virtual ISA for Nvidia GPUs Stable ISA that spans multiple GPU generations Scalable and Machine-independent ISA for C/C++ and other compilers to target Code distribution ISA for application and middleware developers Facilitate hand-coding of libraries, performance kernels, and architecture tests

Overview of GPGPU Architecture and Programming Paradigm

Similar presentations

Presentation on theme: "Overview of GPGPU Architecture and Programming Paradigm"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overview of GPGPU Architecture and Programming Paradigm

Similar presentations

Presentation on theme: "Overview of GPGPU Architecture and Programming Paradigm"— Presentation transcript:

Similar presentations

About project

Feedback