CUDA Grids, Blocks, and Threads

Slides:

Advertisements

Similar presentations

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

More on threads, shared memory, synchronization

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.

Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 5, 2011, 3-DBlocks.ppt Addressing 2-D grids with 3-D blocks Class Discussion Notes.

CUDA Grids, Blocks, and Threads

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013, 3-DBlocks.ppt Addressing 2-D grids with 3-D blocks Class Discussion Notes.

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

CIS 565 Fall 2011 Qing Sun

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

CUDA C/C++ Basics Part 2 - Blocks and Threads

CUDA Programming Model

GPU Memories These notes will introduce:

Device Routines and device variables

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Some things are naturally parallel

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

Device Routines and device variables

Memory and Data Locality

Measuring Performance

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Execution Model – III Streams and Events

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

CUDA Grids, Blocks, and Threads

CUDA Programming Model

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Measuring Performance

CUDA Programming Model

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

Presentation transcript:

CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional and multidimensional grids and blocks How the grid and block structures are defined in CUDA Predefined CUDA variables Adding vectors using one-dimensional structures Adding/multiplying arrays using 2-dimensional structures ITCS 5/4145 Parallel computing, B. Wilkinson, April 17, 2014. CUDAMultiDimBlocks.ppt

Grids, Blocks, and Threads NVIDIA GPUs consist of an array of execution cores, each of which can support a large number of threads, many more than number of cores. Threads grouped into “blocks” Blocks can be 1, 2, or 3 dimensional Each kernel call uses a “grid” of blocks Grids can be 1, 2, or 3 dimensional (3-D available for recent GPUs) Programmer needs to specify grid/block organization on each kernel call (which can be different each time), within limits set by the GPU

CUDA SIMT Thread Structure Allows flexibility and efficiency in processing 1D, 2-D, and 3-D data on GPU. Linked to internal organization Threads in one block execute together. Can be 1, 2, or 3 dimensions (compute capability => 2 see next) Can be 1, 2 or 3 dimensions CUDA C programming guide, v 3.2, 2010, NVIDIA

Device characteristics -- some limitations NVIDIA defines “compute capabilities”, 1.0, 1.1, … with limits and features supported. Compute capability 1.0 (min) 2.x* 3.0/3.5 Grid: Max dimensionality 2 3 3 Max size of each dimension (x, y, z) 65535 65535 231 – 1 (no of blocks in each dimension) (2,147,483,647) Blocks: Max dimensionality 3 3 3 Max sizes of x- and y- dimension 512 1024 1024 Max size of z- dimension 64 64 64 Max number of threads per block overall 512 1024 1024 coit-grid06 and cci-grid07 have C2050s, compute capability 2.0. cci-grid08.uncc.edu has a K20, compute capability 3.5. Most recent in 2013.

myKernel<<< B, T >>>(arg1, … ); Defining Grid/Block Structure Need to provide each kernel call with values for: Number of blocks in each dimension Threads per block in each dimension myKernel<<< B, T >>>(arg1, … ); B – a structure that defines number of blocks in grid in each dimension (1D, 2D, or 3D). T – a structure that defines number of threads in a block in each dimension (1D, 2D, or 3D).

1-D grid and/or 1-D blocks If want a 1-D structure, can use a integer for B and T in: myKernel<<< B, T >>>(arg1, … ); B – An integer would define a 1D grid of that size T –An integer would define a 1D block of that size Example myKernel<<< 1, 100 >>>(arg1, … );

CUDA Built-in Variables for a 1-D grid and 1-D block threadIdx.x -- “thread index” within block in “x” dimension blockIdx.x -- “block index” within grid in “x” dimension blockDim.x -- “block dimension” in “x” dimension (i.e. number of threads in block in x dimension) Full global thread ID in x dimension can be computed by: x = blockIdx.x * blockDim.x + threadIdx.x;

4 blocks, each having 8 threads Example -- x direction A 1-D grid and 1-D block 4 blocks, each having 8 threads Global ID 26 threadIdx.x threadIdx.x threadIdx.x threadIdx.x 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3 gridDim = 4 x 1 blockDim = 8 x 1 Global thread ID = blockIdx.x * blockDim.x + threadIdx.x = 3 * 8 + 2 = thread 26 with linear global addressing Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

Code example with a 1-D grid and blocks Vector addition #define N 2048 // size of vectors #define T 256 // number of threads per block __global__ void vecAdd(int *a, int *b, int *c) { int i = blockIdx.x*blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } int main (int argc, char **argv ) { … vecAdd<<<N/T, T>>>(devA, devB, devC); // assumes N/T is an integer return (0); Note: __global__ CUDA function qualifier. __ is two underscores __global__ must return a void Number of blocks to map each vector across grid, one element of each vector per thread

If T/N not necessarily an integer: #define N 2000 // size of vectors #define T 256 // number of threads per block __global__ void vecAdd(int *a, int *b, int *c) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < N) c[i] = a[i] + b[i]; // allows for more threads than vector elements // some unused } int main (int argc, char **argv ) { int blocks = (N + T - 1) / T; // efficient way of rounding to next integer … vecAdd<<<blocks, T>>>(devA, devB, devC); return (0);

Questions How many threads are created? How many threads are unused? What is the maximum number of threads that can be created in a GPU on coit-grid06/7? On coit-grid08?

Higher dimensional grids/blocks 1-D grid and 1-D block suitable for processing one dimensional data Higher dimensional grids and blocks convenient for higher dimensional data. Processing 2-D arrays might use a two dimensional grid and two dimensional block Might need higher dimensions because of limitation on sizes of block in each dimension CUDA provided with built-in variables and structures to define number of blocks in grid in each dimension and number of threads in a block in each dimension.

Built-in CUDA data types and structures CUDA Vector Types/Structures unit3 and dim3 – can be considered essentially as CUDA-defined structures of unsigned integers: x, y, z, i.e. struct unit3 { x; y; z; }; struct dim3 { x; y; z; }; Used to define grid of blocks and threads, see next. Unassigned structure components automatically set to 1. There are other CUDA vector types.

Built-in Variables for Grid/Block Sizes dim3 gridDim -- Grid dimensions, x, y, z. Number of blocks in grid = gridDim.x * gridDim.y * gridDim.z dim3 blockDim -- Size of block dimensions x, y, and z. Number of threads in a block = blockDim.x * blockDim.y * blockDim.z

Example Initializing Values To set values in each dimensions, use for example: dim3 grid(16, 16); // Grid -- 16 x 16 blocks dim3 block(32, 32); // Block -- 32 x 32 threads … myKernel<<<grid, block>>>(...); which sets: gridDim.x = 16 gridDim.y = 16 gridDim.z = 1 blockDim.x = 32 blockDim.y = 32 blockDim.z = 1 when kernel called

CUDA Built-in Variables for Grid/Block Indices uint3 blockIdx -- block index within grid: blockIdx.x, blockIdx.y, blockIdx.z uint3 threadIdx -- thread index within block: threadIdx.x, threadIdx.y, threadIdx.z 2-D: Full global thread ID in x and y dimensions can be computed by: x = blockIdx.x * blockDim.x + threadIdx.x; y = blockIdx.y * blockDim.y + threadIdx.y; CUDA structures

2-D Grids and 2-D blocks blockIdx.y * blockDim.y + threadIdx.y threadID.y blockIdx.x * blockDim.x + threadIdx.x Thread

Flattening arrays onto linear memory Generally memory allocated dynamically on device (GPU) and we cannot not use two-dimensional indices (e.g. a[row][column]) to access array as we might otherwise. (Why?) We will need to know how the array is laid out in memory and then compute the distance from the beginning of the array. C uses row-major order --- rows are stored one after the other in memory, i.e. row 0 then row 1 etc.

a[row][column] = a[offset] Flattening an array Number of columns, N Array element a[row][column] = a[offset] offset = column + row * N where N is number of column in array column N-1 row Note: Another way to flatten array is: offset = row + column * N We will come back to this later as it does have very significant consequences on performance. row * number of columns

Using CUDA variables int col = blockIdx.x*blockDim.x+threadIdx.x; int row = blockIdx.y*blockDim.y+threadIdx.y; int index = col + row * N; a[index] = …

Example using 2-D grid and 2-D blocks Adding two arrays Corresponding elements of each array added together to form element of third array

CUDA version using 2-D grid and 2-D blocks Adding two arrays #define N 2048 // size of arrays __global__void addMatrix (int *a, int *b, int *c) { int col = blockIdx.x*blockDim.x+threadIdx.x; int row =blockIdx.y*blockDim.y+threadIdx.y; int index = col + row * N; if ( col < N && row < N) c[index]= a[index] + b[index]; } int main() { ... dim3 block (16,16); dim3 grid (N/block.x, N/block.y); addMatrix<<<grid, block>>>(devA, devB, devC); …

Matrix Multiplication Matrix multiplication is an important application in HPC and appears in many applications C = A * B where A, B, and C are matrices (two-dimensional arrays. A restricted case is when B has only one column -- matrix-vector multiplication, which appears in representation of linear equations and partial differential equations

Matrix multiplication, C = A x B

Implementing Matrix Multiplication Sequential Code Assume matrices square (N x N matrices). for (i = 0; i < N; i++) for (j = 0; j < N; j++) { c[i][j] = 0; for (k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } Requires n3 multiplications and n3 additions Sequential time complexity of O(n3). Very easy to parallelize.

for multiplying two arrays CUDA Kernel for multiplying two arrays __global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) { int k, sum = 0; int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; if (col < N && row < N) { for (k = 0; k < N; k++) sum += a[row * N + k] * b[k * N + col]; c[row * N + col] = sum; } In this example, one thread computes one C element and the number of threads must equal or greater than the number of elements

Sequential version with flattened arrays for comparison void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) { int i, j, k, sum; for (row =0; row < N; row++) // row of a for (col =0; col < N; col++) { // column of b sum = 0; for(k = 0; k < N; k++) sum += cpu_a[row * N + k] * cpu_b[k * N + col]; cpu_c[row * N + col] = sum; }

Matrix mapped on 2-D Grids and 2-D blocks A[][column] blockIdx.y * blockDim.y + threadIdx.y Grid Arrays mapped onto structure, one element per thread Block threadID.x threadID.y Array A[row][] Thread blockIdx.x * blockDim.x + threadIdx.x Basically array divided into “tiles” and one tile mapped onto one block

Complete Program (several slides) // Matrix addition program MatrixMult.cu, Barry Wilkinson, Dec. 28, 2010. #include <stdio.h> #include <cuda.h> #include <stdlib.h> __global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) { … } void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) { int main(int argc, char *argv[]) { int i, j; // loop counters int Grid_Dim_x=1, Grid_Dim_y=1; //Grid structure values int Block_Dim_x=1, Block_Dim_y=1; //Block structure values int noThreads_x, noThreads_y; // number of threads available in device, each dimension int noThreads_block; // number of threads in a block int N = 10; // size of array in each dimension int *a,*b,*c,*d; int *dev_a, *dev_b, *dev_c; int size; // number of bytes in arrays cudaEvent_t start, stop; // using cuda events to measure time float elapsed_time_ms; // which is applicable for asynchronous code also cudaEventCreate(&start); cudaEventCreate(&stop); /* --------------------ENTER INPUT PARAMETERS AND ALLOCATE DATA -----------------------*/ … // keyboard input dim3 Grid(Grid_Dim_x, Grid_Dim_x); //Grid structure dim3 Block(Block_Dim_x,Block_Dim_y); //Block structure, threads/block limited by specific device size = N * N * sizeof(int); // number of bytes in total in arrays a = (int*) malloc(size); //dynamically allocated memory for arrays on host b = (int*) malloc(size); c = (int*) malloc(size); // results from GPU d = (int*) malloc(size); // results from CPU … // load arrays with some numbers

cudaMalloc((void**)&dev_a, size); // allocate memory on device /* ------------- COMPUTATION DONE ON GPU ----------------------------*/ cudaMalloc((void**)&dev_a, size); // allocate memory on device cudaMalloc((void**)&dev_b, size); cudaMalloc((void**)&dev_c, size); cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); cudaEventRecord(start, 0); // here start time, after memcpy gpu_matrixmult<<<Grid,Block>>>(dev_a,dev_b,dev_c,N); cudaMemcpy(c, dev_c, size , cudaMemcpyDeviceToHost); cudaEventRecord(stop, 0); // measuse end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms, start, stop ); printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms); Where you measure time will make a big difference See later about using CUDA events to measure time.

/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/ cudaEventRecord(start, 0); // use same timing* cpu_matrixmult(a,b,d,N); // do calculation on host cudaEventRecord(stop, 0); // measure end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms, start, stop ); printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exe. time /* ------------------- check device creates correct results -----------------*/ … /* --------------------- repeat program ----------------------------------------*/ … // while loop to repeat calc with different parameters /* -------------- clean up ---------------------------------------*/ free(a); free(b); free(c); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; }

Effects of First Launch Some Preliminaries Effects of First Launch Program is written so that can repeat with different parameters without stopping program – to eliminate effect of first kernel launch Also might take advantage of caching – seems not significant as first launch

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 1.65, First time Random numbers 0- 9 Answer Check both CPU and GPU same answers

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 2.12 Second time

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 2.16 Third time Subsequently can vary 2.12 – 2.18

Some results 256 x 256 array 8 blocks of 32 x 32 threads Speedup = 151.86

Some results 1024 x 1024 array 32 blocks of 32 x 32 threads Speedup = 860.9

Questions