CS 791v Fall 2001. # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
CUDA exercitation. Ex 1 Analyze device properties of each device on the node by using cudaGetDeviceProperties function Check the compute capability, global.
List Ranking and Parallel Prefix
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Adding GPU Computing to Computer Organization Courses Karen L. Karavanic Portland State University with David Bunde, Knox College and Jens Mache, Lewis.
Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Guide To UNIX Using Linux Third Edition
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
C++ Functions. 2 Agenda What is a function? What is a function? Types of C++ functions: Types of C++ functions: Standard functions Standard functions.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CPSC 252 The Big Three Page 1 The “Big Three” Every class that has data members pointing to dynamically allocated memory must implement these three methods:
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Homework due Test the random number generator Create a 1D array of n ints Fill the array with random numbers between 0 and 100 Compute and report the average.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Week 5 - Monday.  What did we talk about last time?  Processes  Lab 4.
Functions in C++ Top Down Design with Functions. Top-down Design Big picture first broken down into smaller pieces.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
CUDA C/C++ Basics Part 2 - Blocks and Threads
Winter 2009 Tutorial #6 Arrays Part 2, Structures, Debugger
User-Written Functions
CUDA Programming Model
Lesson One – Creating a thread
Basic CUDA Programming
Device Routines and device variables
Programming Massively Parallel Graphics Processors
Operation System Program 4
CUDA Grids, Blocks, and Threads
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Device Routines and device variables
CS 179: Lecture 3.
Programming Massively Parallel Graphics Processors
CUDA Execution Model – III Streams and Events
CUDA Programming Model
CUDA Programming Model
Presentation transcript:

CS 791v Fall 2001

# a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The ccbin command tells nvcc to use # gcc 4.4 as its regular (non-gpu) compiler. # # the uncommented line should do the trick for you. all: hicuda hicuda: hellocuda.cu add.cu #nvcc -ccbin=/home/richard/bin/ $^ -o nvcc $^ -o clean: rm -f *.o rm -f *~

/* This program demonstrates the basics of working with cuda. We use the GPU to add two arrays. We also introduce cuda's approach to error handling and timing using cuda Events. This is the main program. You should also look at the header add.h for the important declarations, and then look at add.cu to see how to define functions that execute on the GPU. */ #include #include "add.h" int main() { // Arrays on the host (CPU) int a[N], b[N], c[N]; /* These will point to memory on the GPU - notice the correspondence between these pointers and the arrays declared above. */ int *dev_a, *dev_b, *dev_c;

/* These calls allocate memory on the GPU (also called the device). This is similar to C's malloc, except that instead of directly returning a pointer to the allocated memory, cudaMalloc returns the pointer through its first argument, which must be a void**. The second argument is the number of bytes we want to allocate. NB: the return value of cudaMalloc (like most cuda functions) is an error code. Strictly speaking, we should check this value and perform error handling if anything went wrong. We do this for the first call to cudaMalloc so you can see what it looks like, but for all other function calls we just point out that you should do error checking. Actually, a good idea would be to wrap this error checking in a function or macro, which is what the Cuda By Example book does. */ cudaError_t err = cudaMalloc( (void**) &dev_a, N * sizeof(int)); if (err != cudaSuccess) { std::cerr << "Error: " << cudaGetErrorString(err) << std::endl; exit(1); } cudaMalloc( (void**) &dev_b, N * sizeof(int)); cudaMalloc( (void**) &dev_c, N * sizeof(int));

// These lines just fill the host arrays with some data so we can do // something interesting. Well, so we can add two arrays. for (int i = 0; i < N; ++i) { a[i] = i; b[i] = i; } /* The following code is responsible for handling timing for code that executes on the GPU. The cuda approach to this problem uses events. For timing purposes, an event is essentially a point in time. We create events for the beginning and end points of the process we want to time. When we want to start timing, we call cudaEventRecord. In this case, we want to record the time it takes to transfer data to the GPU, perform some computations, and transfer data back. */ cudaEvent_t start, end; cudaEventCreate(&start); cudaEventCreate(&end); cudaEventRecord( start, 0 );

/* Once we have host arrays containing data and we have allocated memory on the GPU, we have to transfer data from the host to the device. Again, notice the similarity to C's memcpy function. The first argument is the destination of the copy - in this case a pointer to memory allocated on the device. The second argument is the source of the copy. The third argument is the number of bytes we want to copy. The last argument is a constant that tells cudaMemcpy the direction of the transfer. */ cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(dev_c, c, N * sizeof(int), cudaMemcpyHostToDevice);

/* FINALLY we get to run some code on the GPU. At this point, if you haven't looked at add.cu (in this folder), you should. The comments in that file explain what the add function does, so here let's focus on how add is being called. The first thing to notice is the >>, which you should recognize as _not_ being standard C. This syntactic extension tells nvidia's cuda compiler how to parallelize the execution of the function. We'll get into details as the course progresses, but for we'll say that >> is creating N _blocks_ of 1 _thread_ each. Each of these threads is executing add with a different data element (details of the indexing are in add.cu). In larger programs, you will typically have many more blocks, and each block will have many threads. Each thread will handle a different piece of data, and many threads can execute at the same time. This is how cuda can get such large speedups. */ add >>(dev_a, dev_b, dev_c);

/* Unfortunately, the GPU is to some extent a black box. In order to print the results of our call to add, we have to transfer the data back to the host. We do that with a call to cudaMemcpy, which is just like the cudaMemcpy calls above, except that the direction of the transfer (given by the last argument) is reversed. In a real program we would want to check the error code returned by this function. */ cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost); /* This is the other end of the timing process. We record an event, synchronize on it, and then figure out the difference in time between the start and the stop. We have to call cudaEventSynchronize before we can safely _read_ the value of the stop event. This is because the GPU may not have actually written to the event until all other work has finished. */ cudaEventRecord( end, 0 ); cudaEventSynchronize( end ); float elapsedTime; cudaEventElapsedTime( &elapsedTime, start, end );

/* Let's check that the results are what we expect. */ for (int i = 0; i < N; ++i) { if (c[i] != a[i] + b[i]) { std::cerr << "Oh no! Something went wrong. You should check your cuda install and your GPU. :(" << std::endl; // clean up events - we should check for error codes here. cudaEventDestroy( start ); cudaEventDestroy( end ); // clean up device pointers - just like free in C. We don't have // to check error codes for this one. cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); exit(1); } }

/* Let's let the user know that everything is ok and then display some information about the times we recorded above. */ std::cout << "Yay! Your program's results are correct." << std::endl; std::cout << "Your program took: " << elapsedTime << " ms." << std::endl; // Cleanup in the event of success. cudaEventDestroy( start ); cudaEventDestroy( end ); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); }

/* This header demonstrates how we build cuda programs spanning multiple files. */ #ifndef ADD_H_ #define ADD_H_ // This is the number of elements we want to process. #define N 1024 // This is the declaration of the function that will execute on the GPU. __global__ void add(int*, int*, int*); #endif // ADD_H_

#include "add.h" /* This is the function that each thread will execute on the GPU. The fact that it executes on the device is indicated by the __global__ modifier in front of the return type of the function. After that, the signature of the function isn't special - in particular, the pointers we pass in should point to memory on the device, but this is not indicated by the function's signature. */ __global__ void add(int *a, int *b, int *c) {

/* Each thread knows its identity in the system. This identity is made available in code via indices blockIdx and threadIdx. We write blockIdx.x because block indices are multidimensional. In this case, we have linear arrays of data, so we only need one dimension. If this doesn't make sense, don't worry - the important thing is that the first step in the function is converting the thread's indentity into an index into the data. */ int thread_id = blockIdx.x; /* We make sure that the thread_id isn't too large, and then we assign c = a + b using the index we calculated above. The big picture is that each thread is responsible for adding one element from a and one element from b. Each thread is able to run in parallel, so we get speedup. */ if (thread_id < N) { c[thread_id] = a[thread_id] + b[thread_id]; } }