CUDA C/C++ Basics Part 2 - Blocks and Threads

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
CS 193G Lecture 2: GPU History & CUDA Programming Basics.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
CUDA Programming Model
Introduction to GPU programming using CUDA
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Introduction to CUDA heterogeneous programming
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Introduction to CUDA C Slide credit: Slides adapted from
More on GPU Programming
CUDA Grids, Blocks, and Threads
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Programming Massively Parallel Graphics Processors
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
Performance Evaluation of Concurrent Lock-free Data Structures on GPUs
General Purpose Graphics Processing Units (GPGPUs)
CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

CUDA C/C++ Basics Part 2 - Blocks and Threads REU 2016 By Bhaskar Bhattacharya Adapted from Cyril Zeller, NVIDIA Corporation © NVIDIA Corporation 2011

What did we learn? CUDA is a software architecture used to program the GPU Device vs host __global__ Review of memory management aka pointers

Parallel Programming in CUDA C/C++ b c © NVIDIA Corporation 2011

Addition on the Device A simple kernel to add two integers b c global void add(int *a, int *b, *c = *a + *b; } int *c) { As before global is a CUDA C/C++ keyword meaning add() will execute on the device add() will be called from the host © NVIDIA Corporation 2011

Addition on the Device Note that we use pointers for the variables global void add(int *a, int *b, int *c) { *c = *a + *b; } add() runs on the device, so a, b and c must point to device memory We need to allocate memory on the GPU © NVIDIA Corporation 2011

Memory Management Host and device memory are separate entities Device pointers point to GPU memory May be passed to/from host code May not be dereferenced in host code Host pointers point to CPU memory May be passed to/from device code May not be dereferenced in device code Simple CUDA API for handling device memory cudaMalloc(), cudaFree(), cudaMemcpy() Similar to the C equivalents malloc(), free(), memcpy() © NVIDIA Corporation 2011

Addition on the Device: add() Returning to our add() kernel global void add(int *a, int *b, int *c = *a + *b; } *c) { Let’s take a look at main()… © NVIDIA Corporation 2011

Addition on the Device: main() int main(void) { int a, b, c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c size = sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); **)&d_b, **)&d_c, // Setup input values a = 2; b = 7; © NVIDIA Corporation 2011

Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, cudaMemcpy(d_b, &a, size, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel add<<<1,1>>>(d_a, d_b, on GPU d_c); // Copy result cudaMemcpy(&c, back to host d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); return 0; cudaFree(d_b); cudaFree(d_c); } © NVIDIA Corporation 2011

Moving to Parallel GPU computing is about massive parallelism So how do we run code in parallel on the device? add<<< 1, 1 >>>(); add<<< N, 1 >>>(); Instead of executing add() once, execute N times in parallel © NVIDIA Corporation 2011

Vector Addition on the Device With add() running in parallel we can do vector addition Terminology: each parallel invocation of add() is referred to as a block The set of blocks is referred to as a grid Each invocation can refer to its block index using blockIdx.x global void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } By using blockIdx.x to index into the array, each block handles a different element of the array © NVIDIA Corporation 2011

A little more on grids, blocks and threads Software layer between you and the GPU cores Made up of multiple grids Each grid is made up of multiple blocks Each block is made up of multiple threads GPU handles how each thread is executed on which core

Vector Addition on the Device global void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } On the device, each block can execute in parallel: Block 0 c[0] = Block 1 a[0] + b[0]; c[1] = a[1] + b[1]; Block 2 c[2] = Block 3 a[2] + b[2]; c[3] = a[3] + b[3]; © NVIDIA Corporation 2011

Vector Addition on the Device: add() Returning to our parallelized add() kernel global void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } a b c Let’s take a look at main()… © NVIDIA Corporation 2011

Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c size); size); cudaMalloc((void cudaMalloc((void cudaMalloc((void **)&d_a, **)&d_b, **)&d_c, // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, c = (int *)malloc(size); © NVIDIA Corporation 2011

Vector Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, cudaMemcpy(d_b, a, size, cudaMemcpyHostToDevice); b, size, cudaMemcpyHostToDevice); // Launch add() kernel add<<<N,1>>>(d_a, d_b, on GPU with N blocks d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA Corporation 2011

Vector addition using threads? Check out our first program we ran © NVIDIA Corporation 2011

Combining Blocks and Threads We’ve seen parallel vector addition using: Several blocks with one thread each One block with several threads Let’s adapt vector addition to use both blocks and threads Why? We’ll come to that… First let’s discuss data indexing… © NVIDIA Corporation 2011

Indexing Arrays with Blocks and Threads No longer as simple as using blockIdx.x and threadIdx.x Consider indexing an array with one element per thread (8 threads/block) threadIdx.x threadIdx.x threadIdx.x threadIdx.x 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3 With M threads per block, a unique index for each thread is given by: int index = threadIdx.x + blockIdx.x * M; © NVIDIA Corporation 2011

Indexing Arrays: Example Which thread will operate on the red element? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 M = 8 threadIdx.x = 5 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 blockIdx.x = 2 int index = threadIdx.x + blockIdx.x * M; = 5 + 2 * 8; = 21; © NVIDIA Corporation 2011

Vector Addition with Blocks and Threads Use the built-in variable blockDim.x for threads per block int index = threadIdx.x + blockIdx.x * blockDim.x; Combined version of add() to use parallel threads and parallel blocks global void add(int *a, int *b, int *c) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; } What changes need to be made in main()? © NVIDIA Corporation 2011

Addition with Blocks and Threads: main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c size = N * sizeof(int); // Alloc space for device copies of a, b, c size); cudaMalloc((void cudaMalloc((void cudaMalloc((void **)&d_a, **)&d_b, **)&d_c, // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, c = (int *)malloc(size); © NVIDIA Corporation 2011

Addition with Blocks and Threads: main() // Copy inputs to device cudaMemcpy(d_a, cudaMemcpy(d_b, a, size, cudaMemcpyHostToDevice); b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA Corporation 2011

Handling Arbitrary Vector Sizes Typical problems are not friendly multiples of blockDim.x Avoid accessing beyond the end of the arrays: global void add(int *a, int *b, int *c, int index = threadIdx.x + blockIdx.x if (index < n) c[index] = a[index] + b[index]; } int n) { * blockDim.x; Update the kernel launch: add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N); © NVIDIA Corporation 2011

Why Bother with Threads? Threads seem unnecessary They add a level of complexity What do we gain? Unlike parallel blocks, threads have mechanisms to efficiently: Communicate Synchronize Uber fast thread generation and recycling (within a few clock cycles) To look closer, we need a new example… next lecture! © NVIDIA Corporation 2011

© NVIDIA Corporation 2011 Questions?

Assignment Create another parallel program to find the dot product of two vectors of length 10, 100,1000, and 10000. Do the same for CPU and time both of them accurate to ms or ns (google how to time stuff) Create a quick chart reporting any speed differences between CPU and GPU (break it down by time required for transfer and time taken for execution) You can use the default program we saw a the beginning of the lecture as a starting point

Assignment Vector Size CPU Time Total GPU Time GPU kernel execution time GPU data transfer time

Advanced assignment Use the previous program and start combining different variations of thread and block numbers for 10,000 length vectors Make a graph of the time taken for various permutations of number of threads and blocks