ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Slides:

Advertisements

Similar presentations

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

Advertisements

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

CUDA Grids, Blocks, and Threads

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

CUDA Programming. Floating Point Operations for the CPU and the GPU.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

CIS 565 Fall 2011 Qing Sun

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Matrix Multiplication in CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, , SSL 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

© David Kirk/NVIDIA and Wen-mei W

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.

Slides from “PMPP” book

CUDA Parallelism Model

ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

CUDA Grids, Blocks, and Threads

Memory and Data Locality

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

ECE 8823A GPU Architectures Module 4: Memory Model and Locality

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

CUDA Grids, Blocks, and Threads

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Parallel Computation Patterns (Stencil)

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Chapter 4:Parallel Programming in CUDA C

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

© David Kirk/NVIDIA and Wen-mei W

Presentation transcript:

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource assignment at the block level Scheduling at the warp level Basics of SIMT execution © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 4 Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3 Reference: CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract

A Multi-Dimensional Grid Example host device Grid 1 Block (0, 0) Block (0, 1) Kernel 1 Block (1, 0) Block (1, 1) Grid 2 Block (1,1) Kernel 2 (1,0,0) (1,0,1) (1,0,2) (1,0,3) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Built-In Variables 1D-3D Grid of thread blocks Built-in: gridDim gridDim.x, gridDim.y, gridDim.z Built-in: blockDim blockDim.x, blockDim.y, blockDim.z Example dim3 dimGrid (32,2,2) - 3D grid of thread blocks dim3 dimGrid (2,2,1) - 2D grid of thread blocks dim3 dimGrid (32,1,1) - 1D grid of thread blocks dim3 dimGrid ((n/256.0),1,1) - 1D grid of thread blocks my_kernel<<<dimGrid, dimBlock>>>(..) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Built-In Variables (2) 1D-3D grid of threads in a thread block Built-in: blockIdx blockIDx.x, blockIDx.y, blockIDx.z Built-in: threadIDx threadIDx.x, threadIDx.y, threadIDX.z All blocks have the same thread configuration Example dim3 dimBlock (4,2,2) - 3D grid of thread blocks dim3 dimBlock (2,2,1) - 2D grid of thread blocks dim3 dimBlock (32,1,1) - 1D grid of thread blocks my_kernel<<<dimGrid, dimBlock>>>(..) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Built-In Variables (3) 1D-3D grid of threads in a thread block Built-in: blockIdx blockIDx.x, blockIDx.y, blockIDx.z Built-in: threadIDx threadIDx.x, threadIDx.y, threadIDX.z Initialized by the runtime through a kernel call Range fixed by the compute capability and target devices You can query the device (later) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

2D Examples

Processing a Picture with a 2D Grid 16×16 blocks Show the example calling code from the text and identify the thread configuration Explain sizing of grid blocks and grid dimensions 72x62 pixels © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Row-Major Layout in C/C++ Row*Width+Col = 2*4+1 = 9 M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Source Code of the Picture Kernel __global__ void PictureKernel(float* d_Pin, float* d_Pout, int n,int m) { // Calculate the row # of the d_Pin and d_Pout element to process int Row = blockIdx.y*blockDim.y + threadIdx.y; // Calculate the column # of the d_Pin and d_Pout element to process int Col = blockIdx.x*blockDim.x + threadIdx.x; // each thread computes one element of d_Pout if in range if ((Row < m) && (Col < n)) { d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col]; } © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

STirag Approach Summary Storage layout of data Assign unique ID Map IDs to Data (access) Figure 4.5 Covering a 76×62 picture with 16×blocks. © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

A Simple Running Example Matrix Multiplication A simple illustration of the basic features of memory and thread management in CUDA programs Thread index usage Memory layout Register usage Assume square matrix for simplicity Leave shared memory usage until later © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Square Matrix-Matrix Multiplication P = M * N of size WIDTH x WIDTH Each thread calculates one element of P Each row of M is loaded WIDTH times from global memory Each column of N is loaded WIDTH times from global memory N WIDTH M P WIDTH WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Row-Major Layout in C/C++ Row*Width+Col = 2*4+1 = 9 M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Matrix Multiplication A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) {map double sum = 0; for (int k = 0; k < Width; ++k) double a = M[i * Width + k]; double b = N[k * Width + j]; sum += a * b; } P[i * Width + j] = sum; N k j WIDTH M P Point out that I and j will become global thread indices. i WIDTH k © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign WIDTH WIDTH

Kernel Version: Functional Description col for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col]; N k j Which thread is at coordinate (row,col)? Threads self allocate WIDTH M P Point out that I and j will become global thread indices. row i blockDim.y WIDTH blockDim.x k © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign WIDTH WIDTH

Kernel Function - A Small Example Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix Each has (TILE_WIDTH)2 threads Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks Block(0,0) Block(0,1) P0,0 P0,1 P0,2 P0,3 WIDTH = 4; TILE_WIDTH = 2 Each block has 2*2 = 4 threads P1,0 P1,1 P1,2 P1,3 P2,0 P2,1 P2,2 P2,3 WIDTH/TILE_WIDTH = 2 Use 2* 2 = 4 blocks P3,0 P3,1 P2,3 P3,3 Block(1,0) Block(1,1) © David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13

A Slightly Bigger Example WIDTH = 8; TILE_WIDTH = 2 Each block has 2*2 = 4 threads P2,0 P2,1 P2,2 P2,3 P2,4 P2,5 P2,6 P2,7 P3,0 P3,1 P3,2 P3,3 P3,4 P3,5 P3,6 P3,7 P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7 WIDTH/TILE_WIDTH = 4 Use 4* 4 = 16 blocks P5,0 P5,1 P5,2 P5,3 P5,4 P5,5 P5,6 P5,7 P6,0 P6,1 P6,2 P6,3 P6,4 P6,5 P6,6 P6,7 P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7 © David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13

A Slightly Bigger Example (cont.) WIDTH = 8; TILE_WIDTH = 4 Each block has 4*4 =16 threads P2,0 P2,1 P2,2 P2,3 P2,4 P2,5 P2,6 P2,7 P3,0 P3,1 P3,2 P3,3 P3,4 P3,5 P3,6 P3,7 P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7 WIDTH/TILE_WIDTH = 2 Use 2* 2 = 4 blocks P5,0 P5,1 P5,2 P5,3 P5,4 P5,5 P5,6 P5,7 P6,0 P6,1 P6,2 P6,3 P6,4 P6,5 P6,6 P6,7 P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7 © David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13

Kernel Invocation (Host-side Code) // Setup the execution configuration // TILE_WIDTH is a #define constant dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH, 1); dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width); © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Kernel Function // Matrix multiplication kernel – per thread code { __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)‏ { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Work for Block (0,0) in a TILE_WIDTH = 2 Configuration blockDim.x blockDim.y Col = 0 Col = 1 Col = 0 * 2 + threadIdx.x Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 blockIdx.x blockIdx.y N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3 Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3 Row = 1 M1,0 M1,1 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13

Work for Block (0,1) blockDim.x blockDim.y Col = 2 Col = 3 Col = 1 * 2 + threadIdx.x Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 N2,0 N2,1 N2,2 N2,3 blockIdx.x blockIdx.y N3,0 N3,1 N2,3 N3,3 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3 Row = 0 M1,0 M1,1 M1,2 M1,3 P0,1 P1,1 P1,2 P1,3 Row = 1 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3 © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

A Simple Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the d_P element and d_M int Row = blockIdx.y*blockDim.y+threadIdx.y; // Calculate the column idenx of d_P and d_N int Col = blockIdx.x*blockDim.x+threadIdx.x; if ((Row < Width) && (Col < Width)) { float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } Memory access patterns across threads in a block (inner loop) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Questions? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign