General Purpose Graphics Processing Units (GPGPUs)

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Grids, Blocks, and Threads
Graphics Processing Units
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
An Introduction to Programming with CUDA Paul Richmond
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
GPU Programming with CUDA – Optimisation Mike Griffiths
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Operation of the SM Pipeline
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
The Present and Future of Parallelism on GPUs
Prof. Zhang Gang School of Computer Sci. & Tech.
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Lecture 20 Computing with Graphical Processing Units
Sathish Vadhiyar Parallel Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
Some things are naturally parallel
CS 5513 Computer Architecture Pipelining Examples
Lecture 26: Multiprocessors
Presented by: Isaac Martin
CUDA Parallelism Model
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Lecture 27: Multiprocessors
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Operation of the Basic SM Pipeline
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
CUDA Grids, Blocks, and Threads
CUDA Execution Model - II
Graphics Processing Unit
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
CS 3853 Computer Architecture Pipelining Examples
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

General Purpose Graphics Processing Units (GPGPUs) Lecture notes from MKP, J. Wang, and S. Yalamanchili

Overview & Reading Understand the multi-threaded execution model of modern general purpose graphics processing units (GPUs) Basic architectural organization so we can understand sources of performance and energy efficiency Reading: Section 6.6

What is a GPGPU? Graphics Processing Unit (GPU): (NVIDIA/AMD/Intel) Many-core Architecture Massively Data-Parallel Processor (Compared with a CPU) Highly Multi-threaded GPGPU: General-Purpose GPU, High Performance Computing Become popular with CUDA and OpenCL programming languages

Motivation High Throughput and Memory Bandwidth

Discrete GPUs in the System

Fused GPUs: AMD & Intel Not as powerful as the discrete GPUs On-Chip and sharing the cache

Core Count: NVIDIA All cores are not created equal 1536 cores at 1GHz All cores are not created equal Need to understand the programming model

GPU Architectures (NVIDIA Tesla) Streaming multiprocessor 8 × Streaming processors

NVIDIA GK110 Architectures

CUDA Programming Model NVIDIA Compute Unified Device Architecture (CUDA) Kernel: C-like function executed on GPU SIMD or SIMT Single Instruction Multiple Data/thread (SIMD, SIMT) All threads execute the same instruction But on its own data Lock Step Thread 1 2 3 4 5 6 7 Inst 0 Data Inst 1 Data

CUDA Thread Hierarchy Each thread uses IDs to decide what data to work on 3-dimension Hierarchy: Thread, Block, Grid Block 0,0,0 0,0,1 0,0,2 0,0,3 0,1,0 0,1,1 0,1,2 0,1,3 0,2,0 0,2,1 0,2,2 0,2,3 0,3,0 0,3,1 0,3,2 0,3,3 1,0,0 1,0,1 1,0,2 1,0,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 1 2 3 Thread Kernel 0 Kernel 1 Kernel 2 Grid Block (0,0,0) (0,0,1) (0,1,0) (0,1,1) Grid Block (0,0,0) (0,0,1) (0,1,0) (0,1,1) Grid Block (0,0,0) (0,0,1) (0,1,0) (0,1,1)

Vector Addition + + + + + Let’s assume N=16, blockDim=4  4 blocks for (int index = 0; index < N; ++index) { c[index] = a[index] + b[index]; } + + + + + blockIdx.x = 0 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 0,1,2,3 blockIdx.x = 1 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 4,5,6,7 blockIdx.x = 2 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 8,9,10,11 blockIdx.x = 3 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 12,13,14,15

Vector Addition CPU Program GPU Program Kernel void vector_add ( float *a, float* b, float *c, int N) { for (int index = 0; index < N; ++index) c[index] = a[index] + b[index]; } int main () { vector_add(a, b, c, N); __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; } int main() { dim3 dimBlock( blocksize, blocksize) ; dim3 dimGrid (N/dimBlock.x, N/dimBlock.y); add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N);

GPU Architecture Basics PC I-Cache Fetch Core Decoder SM Memory Memory Controller …… The SI in SIMT FP Unit INT CUDA Core EX MEM WB In-order Core

Execution of a CUDA Program Blocks are scheduled and executed independently on SMs All blocks share memory

Executing a Block of Threads Execution Unit: Warp a group of threads (32 for NVIDIA GPUs) Blocks are partitioned into warps with consecutive thread ID. SM Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 Block 0 128 Threads Block 1 128 Threads

Warp Execution A warp executes one common instruction at a time Threads in a warp are mapped to CUDA cores Warps are switched and executed on SM Warp Execution Inst 1 Inst 2 Inst 3 T T T One warp One warp One warp PC Core SM

Handling Branches CUDA Code: What if threads takes different branches? if(…) … (True for some threads) else … (True for others) What if threads takes different branches? Branch Divergence! T taken not taken

Branch Divergence Occurs within a warp All branch conditions are serialized and will be executed Performance issue: low warp utilization if(…) {… } else { …} Idle threads

Vector Addition N = 60 64 Threads, 1 block Q: Is there any branch divergence? In which warp? __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; }

Example: VectorAdd on GPU CUDA: setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret; PTX (Assembly): __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; }

Example: VectorAdd on GPU N=8, 8 Threads, 1 block, warp size = 4 1 SM, 4 Cores Pipeline: Fetch: One instruction from each warp Round-robin through all warps Execution: In-order execution within warps With proper data forwarding 1 Cycle each stage How many warps?

Execution Sequence Warp0 Warp1 FE DE EXE EXE EXE EXE MEM MEM MEM MEM setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; setp W0 FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; setp W1 FE setp W0 DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra W0 FE setp W1 DE setp W0 setp W0 setp W0 setp W0 EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; @p bra W1 FE @p bra W0 DE setp W1 EXE EXE EXE EXE setp W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra L2 FE @p bra W1 DE bra W0 EXE EXE EXE EXE setp W1 MEM MEM MEM MEM setp W0 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra L2 FE DE bra W1 EXE EXE EXE EXE bra W0 MEM MEM MEM MEM setp W1 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W0 FE DE EXE EXE EXE EXE bra W1 MEM MEM MEM MEM bra W0 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W1 FE ld W0 DE EXE EXE EXE EXE MEM MEM MEM MEM bra W1 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W0 FE ld W1 DE ld W0 EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W1 FE ld W0 DE ld W1 EXE EXE EXE EXE ld W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; add W0 FE ld W1 DE ld W0 EXE EXE EXE EXE ld W1 MEM MEM MEM MEM ld W0 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; add W1 FE add W0 DE ld W1 EXE EXE EXE EXE ld W0 MEM MEM MEM MEM ld W1 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; st W0 FE add W1 DE add W0 EXE EXE EXE EXE ld W1 MEM MEM MEM MEM ld W0 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; st W1 FE st W0 DE add W1 EXE EXE EXE EXE add W0 MEM MEM MEM MEM ld W1 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ret FE st W1 DE st W0 EXE EXE EXE EXE add W1 MEM MEM MEM MEM add W0 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ret FE ret DE st W1 EXE EXE EXE EXE st W0 MEM MEM MEM MEM add W1 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE ret DE ret EXE EXE EXE EXE st W1 MEM MEM MEM MEM st W0 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE ret EXE EXE EXE EXE ret MEM MEM MEM MEM st W1 WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE ret MEM MEM MEM MEM ret WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM ret WB WB WB WB Warp0 Warp1

Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

Study Guide Be able to define the terms thread block, warp, and SIMT with examples Understand the Vector Addition Example in enough detail to Know what operations are in each core at any cycle Given a number of pipeline stages in each core know how many warps are required to fill the pipelines? How many instructions are executed in total? Key differences between fused and discrete GPUs

Glossary CUDA Branch divergence Kernel OpenCL Stream Multiprocessor Thread block Warp