Presentation is loading. Please wait.

Presentation is loading. Please wait.

General Purpose Graphics Processing Units (GPGPUs)

Similar presentations


Presentation on theme: "General Purpose Graphics Processing Units (GPGPUs)"— Presentation transcript:

1 General Purpose Graphics Processing Units (GPGPUs)
Lecture notes from MKP, J. Wang, and S. Yalamanchili

2 Overview & Reading Understand the multi-threaded execution model of modern general purpose graphics processing units (GPUs) Basic architectural organization so we can understand sources of performance and energy efficiency Reading: Section 6.6

3 What is a GPGPU? Graphics Processing Unit (GPU): (NVIDIA/AMD/Intel)
Many-core Architecture Massively Data-Parallel Processor (Compared with a CPU) Highly Multi-threaded GPGPU: General-Purpose GPU, High Performance Computing Become popular with CUDA and OpenCL programming languages

4 Motivation High Throughput and Memory Bandwidth

5 Discrete GPUs in the System

6 Fused GPUs: AMD & Intel Not as powerful as the discrete GPUs
On-Chip and sharing the cache

7 Core Count: NVIDIA All cores are not created equal
1536 cores at 1GHz All cores are not created equal Need to understand the programming model

8 GPU Architectures (NVIDIA Tesla)
Streaming multiprocessor 8 × Streaming processors

9 NVIDIA GK110 Architectures

10 CUDA Programming Model
NVIDIA Compute Unified Device Architecture (CUDA) Kernel: C-like function executed on GPU SIMD or SIMT Single Instruction Multiple Data/thread (SIMD, SIMT) All threads execute the same instruction But on its own data Lock Step Thread 1 2 3 4 5 6 7 Inst 0 Data Inst 1 Data

11 CUDA Thread Hierarchy Each thread uses IDs to decide what data to work on 3-dimension Hierarchy: Thread, Block, Grid Block 0,0,0 0,0,1 0,0,2 0,0,3 0,1,0 0,1,1 0,1,2 0,1,3 0,2,0 0,2,1 0,2,2 0,2,3 0,3,0 0,3,1 0,3,2 0,3,3 1,0,0 1,0,1 1,0,2 1,0,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 1 2 3 Thread Kernel 0 Kernel 1 Kernel 2 Grid Block (0,0,0) (0,0,1) (0,1,0) (0,1,1) Grid Block (0,0,0) (0,0,1) (0,1,0) (0,1,1) Grid Block (0,0,0) (0,0,1) (0,1,0) (0,1,1)

12 Vector Addition + + + + + Let’s assume N=16, blockDim=4  4 blocks
for (int index = 0; index < N; ++index) { c[index] = a[index] + b[index]; } + + + + + blockIdx.x = 0 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 0,1,2,3 blockIdx.x = 1 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 4,5,6,7 blockIdx.x = 2 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 8,9,10,11 blockIdx.x = 3 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 12,13,14,15

13 Vector Addition CPU Program GPU Program Kernel void vector_add
( float *a, float* b, float *c, int N) { for (int index = 0; index < N; ++index) c[index] = a[index] + b[index]; } int main () { vector_add(a, b, c, N); __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; } int main() { dim3 dimBlock( blocksize, blocksize) ; dim3 dimGrid (N/dimBlock.x, N/dimBlock.y); add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N);

14 GPU Architecture Basics
PC I-Cache Fetch Core Decoder SM Memory Memory Controller …… The SI in SIMT FP Unit INT CUDA Core EX MEM WB In-order Core

15 Execution of a CUDA Program
Blocks are scheduled and executed independently on SMs All blocks share memory

16 Executing a Block of Threads
Execution Unit: Warp a group of threads (32 for NVIDIA GPUs) Blocks are partitioned into warps with consecutive thread ID. SM Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 Block 0 128 Threads Block 1 128 Threads

17 Warp Execution A warp executes one common instruction at a time
Threads in a warp are mapped to CUDA cores Warps are switched and executed on SM Warp Execution Inst 1 Inst 2 Inst 3 T T T One warp One warp One warp PC Core SM

18 Handling Branches CUDA Code: What if threads takes different branches?
if(…) … (True for some threads) else … (True for others) What if threads takes different branches? Branch Divergence! T taken not taken

19 Branch Divergence Occurs within a warp
All branch conditions are serialized and will be executed Performance issue: low warp utilization if(…) {… } else { …} Idle threads

20 Vector Addition N = 60 64 Threads, 1 block
Q: Is there any branch divergence? In which warp? __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; }

21 Example: VectorAdd on GPU
CUDA: setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret; PTX (Assembly): __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; }

22 Example: VectorAdd on GPU
N=8, 8 Threads, 1 block, warp size = 4 1 SM, 4 Cores Pipeline: Fetch: One instruction from each warp Round-robin through all warps Execution: In-order execution within warps With proper data forwarding 1 Cycle each stage How many warps?

23 Execution Sequence Warp0 Warp1 FE DE EXE EXE EXE EXE MEM MEM MEM MEM
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

24 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; setp W0 FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

25 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; setp W1 FE setp W0 DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

26 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra W0 FE setp W1 DE setp W0 setp W0 setp W0 setp W0 EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

27 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; @p bra W1 FE @p bra W0 DE setp W1 EXE EXE EXE EXE setp W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

28 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra L2 FE @p bra W1 DE bra W0 EXE EXE EXE EXE setp W1 MEM MEM MEM MEM setp W0 WB WB WB WB Warp0 Warp1

29 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra L2 FE DE bra W1 EXE EXE EXE EXE bra W0 MEM MEM MEM MEM setp W1 WB WB WB WB Warp0 Warp1

30 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W0 FE DE EXE EXE EXE EXE bra W1 MEM MEM MEM MEM bra W0 WB WB WB WB Warp0 Warp1

31 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W1 FE ld W0 DE EXE EXE EXE EXE MEM MEM MEM MEM bra W1 WB WB WB WB Warp0 Warp1

32 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W0 FE ld W1 DE ld W0 EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

33 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W1 FE ld W0 DE ld W1 EXE EXE EXE EXE ld W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

34 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; add W0 FE ld W1 DE ld W0 EXE EXE EXE EXE ld W1 MEM MEM MEM MEM ld W0 WB WB WB WB Warp0 Warp1

35 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; add W1 FE add W0 DE ld W1 EXE EXE EXE EXE ld W0 MEM MEM MEM MEM ld W1 WB WB WB WB Warp0 Warp1

36 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; st W0 FE add W1 DE add W0 EXE EXE EXE EXE ld W1 MEM MEM MEM MEM ld W0 WB WB WB WB Warp0 Warp1

37 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; st W1 FE st W0 DE add W1 EXE EXE EXE EXE add W0 MEM MEM MEM MEM ld W1 WB WB WB WB Warp0 Warp1

38 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ret FE st W1 DE st W0 EXE EXE EXE EXE add W1 MEM MEM MEM MEM add W0 WB WB WB WB Warp0 Warp1

39 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ret FE ret DE st W1 EXE EXE EXE EXE st W0 MEM MEM MEM MEM add W1 WB WB WB WB Warp0 Warp1

40 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE ret DE ret EXE EXE EXE EXE st W1 MEM MEM MEM MEM st W0 WB WB WB WB Warp0 Warp1

41 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE ret EXE EXE EXE EXE ret MEM MEM MEM MEM st W1 WB WB WB WB Warp0 Warp1

42 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE ret MEM MEM MEM MEM ret WB WB WB WB Warp0 Warp1

43 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM ret WB WB WB WB Warp0 Warp1

44 Execution Sequence (cont.)
setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

45 Study Guide Be able to define the terms thread block, warp, and SIMT with examples Understand the Vector Addition Example in enough detail to Know what operations are in each core at any cycle Given a number of pipeline stages in each core know how many warps are required to fill the pipelines? How many instructions are executed in total? Key differences between fused and discrete GPUs

46 Glossary CUDA Branch divergence Kernel OpenCL Stream Multiprocessor
Thread block Warp


Download ppt "General Purpose Graphics Processing Units (GPGPUs)"

Similar presentations


Ads by Google