CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CS 193G Lecture 2: GPU History & CUDA Programming Basics.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Programming Model
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Introduction to CUDA C Slide credit: Slides adapted from
CUDA Parallelism Model
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Programming Massively Parallel Graphics Processors
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
Performance Evaluation of Concurrent Lock-free Data Structures on GPUs
CUDA Programming Model
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012

Overview 1.GPU Architecture 2.Data Parallelism 3.CUDA Program Structure 4.Vector Addition 5.Device Global Memory and Data Transfer 6.Kernel Functions and Threading

GPU Architecture

Some general terminology: A device refers to a GPU in the computer. A host refers to a CPU in the computer.

GPUs vs CPUs Control Cache ALU DRAM Control Cache ALUALU ALUALU ALUALU ALUALU ALUALU ALUALU … Control Cache ALUALU ALUALU ALUALU ALUALU ALUALU ALUALU … Control Cache ALUALU ALUALU ALUALU ALUALU ALUALU ALUALU … Control Cache ALUALU ALUALU ALUALU ALUALU ALUALU ALUALU … DRAM CPUs are designed to minimize the execution time of single threads. GPUs are designed to maximize the throughput of many identical threads working on different memory.

CUDA Capable GPU Architecture Global Memory Texture Cache SP Load/Store SP Texture Cache SP Texture Cache SP Texture Cache SP Texture Cache SP Texture Cache SP Texture Cache SP Texture Cache SP … Thread Execution Manager Input Assembler Host In a modern CUDA capable GPU, multiple streaming units (SMs) contain multiple streaming processors (SPs) that share the same control logic and instruction cache. SM s

CUDA Capable GPU Architecture Global Memory Texture Cache SP Load/Store SP Texture Cache SP Texture Cache SP Texture Cache SP Texture Cache SP Texture Cache SP Texture Cache SP Texture Cache SP … Thread Execution Manager Input Assembler Host CUDA programs define kernels which are what are loaded into the instruction caches on each SM. The kernel allows for hundreds or thousands of threads to perform the same instructions over different data. SM s

Data Parallelism

Task Parallelism vs Data Parallelism Vector processing units and GPUs take care of data parallelism, where the same operation or set of operations need to be performed over a large set of data. Most parallel programs utilize data parallelism to achieve performance improvements and scalability.

Task Parallelism vs Data Parallelism Task parallelism, however, is when a large set of independent (or mostly independent) tasks need to be performed. The tasks are generally much more complex than the operations performed by data parallel programs.

Task Parallelism vs Data Parallelism It is worth mentioning that both can be combined — the tasks can contain data parallel elements. Further, in many ways systems are getting closer to do both. GPU hardware and programming languages like CUDA are allowing more operations to be performed in parallel for the data parallel tasks.

Data Processing Example For example, vector addition is a data parallel: A[0] A[1] A[2] A[3] A[N] … … C[0] C[1] C[2] C[3] C[N] … Vector A Vector B Vector C B[0] B[1] B[2] B[3] B[N]

CUDA Program Structure

CUDA programs are C/C++ programs containing code that uses CUDA extensions. The CUDA compiler (nvcc, nvcxx) is essentially a wrapper around another C/C++ compiler (gcc, g++, llvm). The CUDA compiler compiles the parts for the GPU and the regular compiler compiles for the CPU: NVCC Compiler Heterogeneous Computing Platform With CPUs and GPUs Host C preprocessor, compiler, linker Device just-in-time Compiler Integrated C programs with CUDA extensions Host Code Device Code

CUDA Program Structure When CUDA programs run, they alternate between CPU serial code and GPU parallel kernels (which execute many threads in parallel). The program will block waiting for the CUDA kernel to complete executing all its parallel threads. CPU serial code GPU parallel kernel KernelA >>(ar gs); CPU serial code GPU parallel kernel KernelA >>(ar gs); Block of Threads

Vector Addition

Vector Addition in C //Compute vector sum h_C = h_A + h_B void vecAdd(float* h_A, float* h_B, float* h_C, int n) { for (int i = 0; i < n; i++) { h_C[i] = h_A[i] + h_B[i]; } int main() { float* h_A = new float[100]; float* h_B = new float[100]; float* h_C = new float[100]; … //assign elements into h_A, h_b … vecAdd(h_A, h_B, h_C); } The above is a simple C program which performs vector addition. The arrays h_A and h_b are added with the results being stored into h_C.

Skeleton Vector Add in CUDA It’s important to note that GPUs and CPUs do not share the same memory, so in a CUDA program you have to move the memory back and forth. Host (CPU) Device (GPU) Host Memory (RAM) Global Memory

Skeleton Vector Add in CUDA //We want to parallelize this! //Compute vector sum h_C = h_A + h_B void vecAdd(float* h_A, float* h_B, float* h_C, int n) { for (int i = 0; i < n; i++) { h_C[i] = h_A[i] + h_B[i]; } int main() { float* h_A = new float[100]; float* h_B = new float[100]; float* h_C = new float[100]; … //assign elements into h_A, h_b //allocate enough memory for h_A, h_B, h_C on the GPU //copy the memory of h_A and h_B onto the GPU vecAddCUDA(h_A, h_B, h_C); //perform this on the GPU! //copy the memory of h_C on the GPU back into h_C on the CPU }

Device Global Memory and Data Transfer

Memory Operations in CUDA cudaMalloc(); //allocates memory on the GPU (like malloc) cudaFree(); //frees memory on the GPU (like free) cudaMemcpy(); //copies memory from the host to the //device (like memcpy) CUDA provides three methods for allocating on and moving memory to GPUs. All three methods can return a cudaError_t type, which can be used to test for error conditions, eg: cudaError_t err = cudaMalloc((void**) &d_A, size);

Memory Operations in CUDA All three methods can return a cudaError_t type, which can be used to test for error conditions, eg: cudaError_t err = cudaMalloc((void**) &d_A, size); if (err != cudaSuccess) { printf(“%s in file %s at line %d\n”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); }

Skeleton Vector Add in CUDA int main() { int size = 100; float* h_A = new float[size]; float* h_B = new float[size]; float* h_C = new float[size]; //assign elements into h_A, h_b … //allocate enough memory for h_A, h_B, h_C as d_A, d_B, d_C //on the GPU float *d_A, *d_B, *d_C; cudaMalloc((void**) &d_A, size); cudaMalloc((void**) &d_B, size); cudaMalloc((void**) &d_C, size); //copy the memory of h_A and h_B onto the GPU cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); vecAddCUDA(h_A, h_B, h_C); //perform this on the GPU! //copy the memory of h_C on the GPU back into h_C on the CPU cudaMemcpy(d_C, h_C, size, cudaMemcpyDeviceToHost); //free the memory on the device cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); }

Kernel Functions and Threading

CUDA Grids CUDA assigns groups of threads to blocks, and groups of blocks to grids. For now, we’ll just worry about blocks of threads. Using the blockIdx, blockDim, and threadIdx keywords, you can determine which thread is being run in the kernel, from which block. For our GPU vector add, this will work something like this: … 256 i = (blockIdx.x * blockDim.x) + threadIdx.x; d_C = d_A[i] + d_B[i]; Block 0 Threa d … … 256 i = (blockIdx.x * blockDim.x) + threadIdx.x; d_C = d_A[i] + d_B[i]; Block 1 Thread … … 256 i = (blockIdx.x * blockDim.x) + threadIdx.x; d_C = d_A[i] + d_B[i]; Block N-1 Thread … …

Vector Add Kernel //Computes the vector sum on the GPU. __global__ void vecAddKernel(float* A, float* B, float* C, int n) { int i = threadIdx.x + (blockDim.x * blockIdx.x); if (i < n) C[i] = A[i] + B[i]; } int main() { … //this will create ceil(size/256.0) blocks, each with 256 threads //that will each run the vecAddKernel. vecAddKernel >>(d_A, d_B, d_C, size); … }

CUDA keywords By default, all functions are __host__ functions, so you typically do not see this frequently in CUDA code. __device__ functions are for use within kernel functions, they can only be called and used on the GPU. __global__ functions are accessible from the CPU, however they are executed on the GPU. Executed on the:Only callable from the: __device__ float deviceFunc() device __global__ void KernelFunc() devicehost __host__ float HostFunc()host

Conclusions