1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

List Ranking and Parallel Prefix

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Adding GPU Computing to Computer Organization Courses Karen L. Karavanic Portland State University with David Bunde, Knox College and Jens Mache, Lewis.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.

Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA Grids, Blocks, and Threads

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.

CIS 565 Fall 2011 Qing Sun

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.

GPU Architecture and Programming

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

OpenCL Programming James Perry EPCC The University of Edinburgh.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

CUDA Programming Model

Basic CUDA Programming

Device Routines and device variables

CUDA Grids, Blocks, and Threads

Device Routines and device variables

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

CUDA Grids, Blocks, and Threads

CUDA Programming Model

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

CUDA Programming Model

6- General Purpose GPU Programming

Presentation transcript:

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming model CUDA kernel Simple CUDA program to add two vectors together Compiling the code on a Linux system

To write a CUDA program, one needs to write a code sequence that all the threads on the GPU will do. In CUDA, this code sequence is called a Kernel routine Kernel code will be regular C except one typically needs to use the thread ID in expressions to ensure each thread accesses different data: Example … index = ThreadID; c[index] = a[index] + b[index]; 2 CUDA kernel routine All theads do this but with their own thread ID

3 CPU and GPU memory Program once compiled has code executed on CPU and (kernel) code executed on GPU Separate memories on CPU and GPU Need to * Explicitly transfer data from CPU to GPU for GPU computation, and Explicitly transfer results in GPU memory copied back to CPU memory Copy from CPU to GPU Copy from GPU to CPU GPU CPU CPU main memory GPU global memory * CUDA version 3. Version 4 (May 2011) can eliminate that.

4 Basic CUDA program structure int main (int argc, char **argv ) { 1. Allocate memory space in device (GPU) for data 2. Allocate memory space in host (CPU) for data 3. Copy data to GPU 4. Call “kernel” routine to execute on GPU ( with CUDA syntax that defines no of threads and their physical structure) 5. Transfer results from GPU to CPU 6. Free memory space in device (GPU) 7. Free memory space in host (CPU) return; }

5 1. Allocating memory space in “device” (GPU) for data Use CUDA malloc routines: int size = N *sizeof( int); // space for N integers int *devA, *devB, *devC; // devA, devB, devC ptrs cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

6 2. Allocating memory space in “host” (CPU) for data Use regular C malloc routines: int *a, *b, *c; … a = (int*)malloc(size); b = (int*)malloc(size); c = (int*)malloc(size); or statically declare variables: #define N 256 … int a[N], b[N], c[N];

7 3. Transferring data from host (CPU) to device (GPU) Use CUDA routine cudaMemcpy cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice); cudaMemcpy( devB, b, size, cudaMemcpyHostToDevice); where: devA and devB are pointers to destination in device a and b are pointers to host data DestinationSource

8 4. Declaring “kernel” routine to execute on device (GPU) CUDA introduces a syntax addition to C: Triple angle brackets mark call from host code to device code. Contains organization and number of threads in two parameters: myKernel >>(arg1, … ); n and m will define organization of thread blocks and threads in a block. For now, we will set n = 1, which say one block and m = N, which says N threads in this block. arg1, …, -- arguments to routine myKernel typically pointers to device memory obtained previously from cudaMalloc.

9 Example – Adding to vectors A and B #define N 256 __global__ void vecAdd(int *a, int *b, int *c) { // Kernel definition int i = threadIdx.x; c[i] = a[i] + b[i]; } int main() { // allocate device memory & // copy data to device // device mem. ptrs devA,devB,devC vecAdd >>(devA,devB,devC); // Grid of one block, N threads in block … } Loosely derived from CUDA C programming guide, v 3.2, 2010, NVIDIA Declaring a Kernel Routine Each of the N threads performs one pair- wise addition: Thread 0: devC[0] = devA[0] + devB[0]; Thread 1: devC[1] = devA[1] + devB[1]; Thread N-1: devC[N-1] = devA[N-1]+devB[N-1]; CUDA structure that provides thread ID in block Two underscores each side A kernel defined using CUDA specifier __global__

10 5. Transferring data from device (GPU) to host (CPU) Use CUDA routine cudaMemcpy cudaMemcpy( c, devC, size, cudaMemcpyDeviceToHost); where: devC is a pointer in device and c is a pointer in host. DestinationSource

11 6. Free memory space in “device” (GPU) Use CUDA cudaFree routine: cudaFree( devA); cudaFree( devB); cudaFree( devC);

12 7. Free memory space in (CPU) host (if CPU memory allocated with malloc) Use regular C free routine to deallocate memory if previously allocated with malloc: free( a ); free( b ); free( c );

13 Complete CUDA program Adding two vectors, A and B N elements in A and B, and N threads (without code to load arrays with data) #define N 256 __global__ void vecAdd(int *A, int *B, int *C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { int size = N *sizeof( int); int a[N], b[N], c[N], *devA, *devB, *devC; cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice); cudaMemcpy( devB, b size, cudaMemcpyHostToDevice); vecAdd >>(devA, devB, devC); cudaMemcpy( c, devC size, cudaMemcpyDeviceToHost); cudaFree( devA); cudaFree( devB); cudaFree( devC); return (0); }

14 Compiling CUDA programs “nvcc” NVIDIA provides nvcc -- the NVIDIA CUDA “compiler driver”. Will separate out code for host and for device Regular C/C++ compiler used for host (needs to be available) Programmer simply uses nvcc instead of gcc/cc compiler on a Linux system Command line options include for GPU features

15 Compiling code - Linux Command line: nvcc –O3 –o -I/usr/local/cuda/include –L/usr/local/cuda/lib –lcuda –lcudart CUDA source file that includes device code has the extension.cu Need regular C compiler installed for CPU. Make file convenient – see next. See “The CUDA Compiler Driver NVCC” from NVIDIA for more details Optimization level if you want optimized code Directories for #include files Directories for libraries (Dynamic) CUDA libraries to be linked (core and runtime, both needed)

16 Very simple sample Make file NVCC = /usr/local/cuda/bin/nvcc CUDAPATH = /usr/local/cuda NVCCFLAGS = -I$(CUDAPATH)/include LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart –lm X11FLAGS = -L/usr/X11R6/lib -lX11 prog1: prog1.c cc -o prog1 prog1.c –lm prog2: prog2.c cc -o prog2 prog2.c –lm $(X11FLAGS) prog3: prog3.c $(NVCC) $(NVCCFLAGS) -o prog3 prog3.cu ) $(LFLAGS) prog4: prog4.c $(NVCC) $(NVCCFLAGS) -o prog4 prog4.cu $(LFLAGS) $(X11FLAGS) A regular C program A C program with X11 graphics A CUDA program A CUDA program with X11 graphics

17 Compilation process nvcc gccptxas nvcc “wrapper” divides code into host and device parts. Host part compiled by regular C compiler Device part compiled by NVIDIA “ptxas” assembler Two compiled parts combined into one executable executable Combine Object file nvcc –o prog prog.cu –I/includepath -L/libpath Executable file a “fat” binary” with both host and device code

18 Executing Program Simple type name of executable created by nvcc:./prog1 File includes all the code for host and for device in a “fat binary” file. Host code starts running When first encounter device kernel, GPU code physically sent to GPU and function launched on GPU. I am told by NVIDIA present NVIDIA GPUs do not have instruction caches, so this process is repeated for each call. I am told by NVIDIA the overhead is very small. * Correction from previous slides.

19 Compiling and executing on a Windows system Can use Microsoft Visual Studio and a PC with a NVIDIA GPU card. Basic set up described in “Configuring Microsoft Visual Studio 2008 for CUDA Tookit Version 3.2,” B. Wilkinson and Brian Nacey, Feb 24, 2012, found at VSforCUDA.pdf but NVIDA now provides a fully configured NVIDIA Nsight Visual Studio Edition found at and Eclipse version found at

20 NVIDIA Nsight Visual Studio Edition

Questions