Overview of GPGPU Architecture and Programming Paradigm

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Extracted directly from:
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
My Coordinates Office EM G.27 contact time:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
CUDA programming Performance considerations (CUDA best practices)
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Slides from “PMPP” book
Mattan Erez The University of Texas at Austin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Programming Massively Parallel Processors Performance Considerations
Mattan Erez The University of Texas at Austin
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Mattan Erez The University of Texas at Austin
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Overview of GPGPU Architecture and Programming Paradigm

GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm

Basic Blocks Several shader cores/streaming multiprocessor (SM) Interconnection network On-chip memory controllers On-chip caches (level1/2) Off-chip DRAM

… Basic Blocks Hardware thread scheduling MC0 MC1 MC2 MC3 MCL DRAM L2 … INTERCONNECT SM Texture Processor Cluster0 Texture Processor Cluster1 Texture Processor ClusterM GPU Kernels … matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); Streaming Multiprocessor Thread Scheduler Texture Cache Instruction Cache Constant Cache High BW on-chip network Compile with CUDA compiler Decoder Thread batch - HW unit of thread execution (Warp - Nvidia) (Wavefront - ATI) Shared Memory … mov.s32 %r14, 15; and.b32 %r15, %r13, %r14; add.s32 %r16, %r15, %r12; shr.s32 %r17, %r16, 4; ... SP SP SP … SP SP SP SP … SP Memory Controllers SP SP SP … SP Traditional GPU is a GP throughput architecture - Simple processor architecture but more number of processors Complex interconnect Has hardware scheduler too Large scratch pad memory (software managed cache) Large register file (context switch is cheap) Single set of instructions are used by all threads Light weight threads Warp / wavefornt/ threadbatch hardware thread execution pipeline Hardware thread scheduling Threads have dedicated registers Shared memory among thread block Same PC for all threads in warp Separate ALU and memory pipeline PTX assembly Register File … … … … … SP SP SP … SP Off-chip memory array Light weight threads grouped into thread-blocks

Streaming Multiprocessor Multi thread unit Instruction cache/decoder Several single processor (SP) Load-store/SFU units Large register file Shared memory Shared texture caches Constant cache

MT- unit (Global Block) scheduler Examples : G80 and GT200 MT- unit (Global Block) scheduler TPC – texture processor cluster (group of SM share same texture unit) 2 GPU generation G80 and GT200 shown

GT300 (Fermi) Examples : GT300 Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache) Fermi Streaming Multiprocessor (SM) http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Comparison G80 vs. GT200 vs. GT300 http://www.dvhardware.net/article38173.html

Example: GK110 (Kepler Architecture) More power efficient than Fermi. New SM architecture (SMX). Revamped memory architecture. Hardware support for new programing models. Capable of Dynamic Parallelism. Source: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

Basic GPGPU Processor Pipeline Simple in-order execution in SIMT Single instruction multiple threads Scheduler chooses one of several warps (PC) Fetches 1 instruction from the I$ per warp Decodes the instruction, reads register and dispatches Scoreboard maintains dependencies Multi-ported register file provides data for all lanes Numerous ALU, FPU, LD/ST, SFU lanes run simultaneously (different speeds) Writeback updates the register file. SIMT  A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge.

GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm

Inside Streaming Multiprocessor Streaming Multiprocessor (G80) 8 Streaming Processors (SP) 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 16 KB shared memory

Register File 8192 registers in each SM in G80 Implementation decision, not part of programming abstraction Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Threads access registers assigned to itself

Thread Dispatch Policy Hierarchy of grid of blocks of threads Blocks are serially distributed to SM Potentially >1 Block/SM SM launches Warps (32 threads) 2 levels of parallelism Round-robin, ready-to-execute scheduling policy Figure source – Nvidia CUDA Programming Guide 2.3

Block Execution : Software View SM block execution (G80) Assignment in block granularity Up to 8 blocks/SM as resource allows SM in G80 can take up to 768 threads 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently

Block Execution : Software View Automatic Scalability

Block Execution : Hardware View Blocks divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM 3 blocks/SM and each Block 256 threads, then Block is divided into 256/32 = 8 Warps Total 8 * 3 = 24 Warps Only 1 of the 24 Warps will be selected for instruction fetch and execution.

Zero-overhead Context Switching Next warp Warp Scheduling I Zero-overhead Context Switching Next warp Instruction which has it’s operands ready Eligible warps Prioritized scheduling policy (no details available) Threads in Warp execute the same instruction

Warp Scheduling II

Latency Hiding 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions 13 Warps to hide 200-cycle DRAM latency

Hide read-after-write latency G80 Pipeline ~30 stages: fetch, decode, gather and write-back act on whole warps -throughput of 1 warp/slow clock Execute acts on group of 8 threads (only 8 SP/SM), throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks (see next slide) Fetch/decode stages have a higher throughput to feed both the MAD and the SFU/MUL units. Peak rate of 8 MAD + 8 MUL per (fast) clock cycle 8 blk / SM = 512 thd / blk = 32 thd warp / 8 SP Hide read-after-write latency 1 memory access / 2 instructions 32 cycles / memory access 200 / 32 = 6 warps These pipeline cycle counts are retrieved from Nvidia forum and there is not available document to verify the figures

Execute Stage (no memory access) Inst#1-SP 1 Thread -1 Inst#1-SP 2 Thread -2 Inst#1-SP 3 Thread -3 Inst#1-SP 4 Thread -4 Inst#1-SP 5 Thread -5 Inst#1-SP 6 Thread -6 Inst#1-SP 7 Thread -7 Inst#1-SP 8 Thread -8 Inst#1-SP 1 Thread -9 Inst#1-SP 2 Thread -10 Inst#1-SP 3 Thread -11 Inst#1-SP 4 Thread -12 Inst#1-SP 5 Thread -13 Inst#1-SP 6 Thread -14 Inst#1-SP 7 Thread -15 Inst#1-SP 8 Thread -16 Inst#1-SP 1 Thread -17 Inst#1-SP 2 Thread -18 Inst#1-SP 3 Thread -19 Inst#1-SP 4 Thread -20 Inst#1-SP 5 Thread -21 Inst#1-SP 6 Thread -22 Inst#1-SP 7 Thread -23 Inst#1-SP 8 Thread -24 Inst#1-SP 1 Thread -25 Inst#1-SP 2 Thread -26 Inst#1-SP 3 Thread -27 Inst#1-SP 4 Thread -28 Inst#1-SP 5 Thread -29 Inst#1-SP 6 Thread -30 Inst#1-SP 7 Thread -31 Inst#1-SP 8 Thread -32 Inst#2-SP 1 Thread -1 Inst#2-SP 2 Thread -2 Inst#2-SP 3 Thread -3 Inst#2-SP 4 Thread -4 Inst#2-SP 5 Thread -5 Inst#2-SP 6 Thread -6 Inst#2-SP 7 Thread -7 Inst#2-SP 8 Thread -8 Inst#2-SP 1 Thread -9 Inst#2-SP 2 Thread -10 Inst#2-SP 3 Thread -11 Inst#2-SP 4 Thread -12 Inst#2-SP 5 Thread -13 Inst#2-SP 6 Thread -14 Inst#2-SP 7 Thread -15 Inst#2-SP 8 Thread -16 Inst#2-SP 1 Thread -17 Inst#2-SP 2 Thread -18 Inst#2-SP 3 Thread -19 Inst#2-SP 4 Thread -20 Inst#2-SP 5 Thread -21 Inst#2-SP 6 Thread -22 Inst#2-SP 7 Thread -23 Inst#2-SP 8 Thread -24 Inst#2-SP 1 Thread -25 Inst#2-SP 2 Thread -26 Inst#2-SP 3 Thread -27 Inst#2-SP 4 Thread -28 Inst#2-SP 5 Thread -29 Inst#2-SP 6 Thread -30 Inst#2-SP 7 Thread -31 Inst#2-SP 8 Thread -32 SP 1 SP 2 SP 3 SP 4 SP 5 SP 6 SP 7 SP 8

Instruction Buffer Fetch 1 warp instruction/cycle From instruction L1 cache Into any instruction buffer slot Issue 1 “ready-to-go” warp instruction/cycle From any warp - instruction buffer slot Operand scoreboarding used to prevent hazards Issue selection based on round-robin / age of warp among “ready-to-go” warps SM broadcasts the same instruction to 32 Threads of a Warp

Scoreboarding Register operands in the Instruction Buffer are scoreboarded Instruction becomes ready when needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled Memory/Processor pipelines Continue to issue instructions until scoreboarding prevents Allows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops

Pathology: Warp Divergence Conditional branches splits 32-thread warps Diverged threads serializes Some SPs are idle in serialized warps Lowers performance significantly

GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm

Memory Hierarchy Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can R/W global, constant, and texture memories using Copy function Figure source – Nvidia CUDA Programming Guide 2.3

Immediate address constants Indexed address constants Constant Memory Immediate address constants Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM Constants broadcast to all threads in a Warp Efficient way of accessing a value that is common for all threads in a block! I $ L 1 Multithreaded Instruction Buffer R C $ Shard F L 1 Mem Operand Select MAD SFU

Each SM has 16 KB of Shared Memory 16 banks of 32bit words CUDA uses Shared Memory as shared storage visible to all threads in a thread block Fast read and write access Not used explicitly for pixel shader programs we dislike pixels talking to each other  I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Memory Banking Parallel Memory Architecture Memory is divided into banks Essential to achieve high bandwidth Banks can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

No Bank Conflicts No Bank Conflicts No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Bank Conflicts 2-way Bank Conflicts 8-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Memory Banks in G80 G80 DRAM bank conflicts Bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp Memory access scheduling are half-wrap based No bank conflicts between different half-warps, only within a single half-warp.

Memory Access Optimizations Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank

Memory Access Optimizations - continues Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd s=1 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 s=3 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Global Memory vs. Shared Memory Use coalesced reads. The bandwidth of global memory interface is 64 bytes, hence can coalesce 16*32 bit data from half warp. Or read using a texture with good spatial locality within each warp. Use multiple (16) banks for independent accesses from half warp.

GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm

GPGPU μarch : Interconnect Topology GPU Interconnect DRAM controller is on-chip DRAM is off-chip

GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm

Device Memory Management cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory Figure source – Nvidia CUDA Programming Guide 2.3

Device Memory Management Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md “d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md);

Host-Device Communication cudaMemcpy()‏ Memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfer Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory Figure source – Nvidia CUDA Programming Guide 2.3

Host - Device Communication Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

GPGPU Architecture Overview Core Architecture Memory Hierarchy Outline GPGPU Architecture Overview Core Architecture Memory Hierarchy Interconnect CPU-GPU Interfacing Programming Paradigm

Programming Abstraction I Threads are grouped as Grids of Blocks of Threads 3D indexing for each threads Kernel executions are serialized Figure source – Nvidia CUDA Programming Guide 2.3

Programming Paradigm: Compute Unified Device Architecture Hierarchical thread organization Each level has special purpose Instruction sharing (top) Resource sharing Execution unit sharing (bottom) Defines single thread behavior Thread identifier distinguishes data Compiler optimization difficult Hardware resource sharing varies in different layers Typical Throughput Kernel Launch Example CUDA/OpenCL etc implements programming model for throughput computing. Hierarchically threads are organized at different levels of sharing abstraction. Kernel -> Single thread instructions are defined (SIMT). Instruction sharing Block -> Resource sharing (shared memory, reg file, etc) Warp -> ALU sharing

Programming Abstraction II Source: David Kirk & Wen-mei Hwu lectures (http://courses.ece.illinois.edu/ece498/al/Syllabus.html)

CUDA – Square Matrix Multiplication P = M * N Each of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH times from global memory

CUDA – Square Matrix Multiplication Memory layout of a matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

CUDA – Square Matrix Multiplication CPU Version in C void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; DOT-product loop

CUDA – Square Matrix Multiplication GPU Version in CUDA void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏ { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); 2. // Kernel invocation code – to be shown later 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } Input matrix data transfer Output matrix data transfer

CUDA – Square Matrix Multiplication // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the elements // that is computed by the thread float Pvalue = 0; for(int k = 0;k<Width; ++k)‏ float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y*Width +threadIdx.x] = Pvalue; } Nd k DOT-product loop tx WIDTH Md Pd ty ty WIDTH tx k WIDTH WIDTH

CUDA – Square Matrix Multiplication One Block of threads compute matrix Pd Each thread computes 1 - Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one MAD for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by number of threads allowed in a thread block – see next slide Nd Grid 1 Block 1 Thread (2, 2)‏ 48 Pd WIDTH Md

CUDA – Square Matrix Multiplication Each 2D thread block computes a (BLOCK_SIZE)2 sub-matrix (tile) of the result matrix Blocks have (BLOCK_SIZE)2 threads Generate a 2D Grid of blocks with (WIDTH/BLOCK_SIZE)2 Note, you still need to put a loop around the kernel call for cases where WIDTH/BLOCK_SIZE is greater than max grid size (64K)!

Software Stack & Compilation Trajectory Figure source - Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt, ISPASS-2009

PTX ISA Virtual ISA for Nvidia GPUs Stable ISA that spans multiple GPU generations Machine-independent ISA for C/C++ and other compilers to target Code distribution ISA for application and middleware developers Facilitate hand-coding of libraries, performance kernels, and architecture tests

Benefits of Parallel Thread eXecution ISA Adaptable Virtual ISA for Nvidia GPUs Stable ISA that spans multiple GPU generations Scalable and Machine-independent ISA for C/C++ and other compilers to target Code distribution ISA for application and middleware developers Facilitate hand-coding of libraries, performance kernels, and architecture tests