ME964 High Performance Computing for Engineering Applications Most of the time I don't have much fun. The rest of the time I don't have any fun at all.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 2: The CUDA Programming Model.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2015 University of Wisconsin-Madison Lecture notes based on slides created.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Computing with CUDA
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
1 ECE 498AL Lecture 2: The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Fall 2007 Illinois ECE 498AL1 Programming Massively Parallel.
Lecture 8-2 : CUDA Programming Slide Courtesy : Dr. David Kirk and Dr. Wen-Mei Hwu and Mulphy Stein.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Martin Kruliš by Martin Kruliš (v1.0)1.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Computer Engg, IIT(BHU)
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
ME964 High Performance Computing for Engineering Applications
CUDA Programming Model
Slides from “PMPP” book
ECE/CS 757: Advanced Computer Architecture II GPGPUs
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
CUDA Execution Model – III Streams and Events
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
ECE 498AL Lecture 2: The CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

ME964 High Performance Computing for Engineering Applications Most of the time I don't have much fun. The rest of the time I don't have any fun at all. – Woody Allen © Dan Negrut, 2011 ME964 UW-Madison The CUDA API February 08, 2011

Before We Get Started… Last time Andrew: wrap up building CUDA apps in Visual Studio 2008 Andrew: running apps through the HPC scheduler on Newton Very high-level overview of the CUDA programming model Discussed Index issues in the context of the “execution configuration” and how the index of a thread translates into an ID of a thread Brief discussion of the memory spaces in relation to GPU computing Today Discussion of the CUDA API One-on-one with Andrew if you have compile/build issues in CUDA 3-5 PM in room 2042ME HW HW2: due date was 02/08. Now 02/10 HW3 has been posted. Due date: 02/15 Small matrix-vector multiplication Matrix addition – requires use of multiple blocks 2

Putting Things in Perspective… CUDA programming model and execution configuration Basic concepts and data types - just finished this… CUDA application programming interface Working on it next Simple example to illustrate basic concepts and functionality Coming up shortly Performance features will be covered later 3

4 The CUDA API

What is an API? Application Programming Interface (API) A set of functions, procedures or classes that an operating system, library, or service provides to support requests made by computer programs (from Wikipedia) Example: OpenGL, a graphics library, has its own API that allows one to draw a line, rotate it, resize it, etc. Cooked up analogy (for the mechanical engineer) Think of a car, you can say it has a certain Device Operating Interface (DOI): A series of pedals, gauges, steering wheel, etc. This would be its DOI In this context, CUDA provides an API that enables you to tap into the computational resources of the NVIDIA’s GPUs This is what replaced the old GPGPU way of programming the hardware 5 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

On the CUDA API Reading the CUDA Programming Guide you’ll run numerous references to the CUDA Runtime API and CUDA Driver API Many time they talk about “CUDA runtime” and “CUDA driver”. What they mean is CUDA Runtime API and CUDA Driver API CUDA Runtime API – is the friendly face that you can choose to see when interacting with the GPU. This is what gets identified with “C CUDA” Needs nvcc compiler to generate an executable CUDA Driver API – this is more like how it was back in the day: low level way of interacting with the GPU You have significantly more control over the host-device interaction Significantly clunkier way to dialogue with the GPU, typically only needs a C compiler I don’t anticipate any reason to use the CUDA Driver API 6

Talking about the API: The C CUDA Software Stack Image at right indicates where the API fits in the picture 7 An API layer is indicated by a thick red line: NOTE: any CUDA runtime function has a name that starts with “cuda” Examples: cudaMalloc, cudaFree, cudaMemcpy, etc. Examples of CUDA Libraries: CUFFT, CUBLAS, CUSP, thrust, etc.

CUDA Function Declarations Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() devicehost __host__ float HostFunc() host __global__ defines a kernel function Must return void __device__ and __host__ can be used together 8

CUDA Function Declarations (cont.) __device__ functions can’t have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments Something like printf would not work… 9 HK-UIUC

Compiling CUDA Any source file containing CUDA language extensions must be compiled with nvcc You spot such a file by its.cu suffix nvcc is a compile driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl,... nvcc can output: C code Must then be compiled with the rest of the application using another tool ptx code (CUDA’s ISA) Or directly object code (cubin) 10

Compiling CUDA nvcc Compile driver Invokes cudacc, gcc, cl, etc. PTX Parallel Thread eXecution Like assembly language NVIDIA’s ISA NVCC C/C++ CUDA Application PTX to Target Compiler G80 … GPU Target code PTX Code CPU Code ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; mad.f32 $f1, $f5, $f3, $f1; 11

Compiling CUDA extended C 12

The nvcc Compiler – Suffix Info File suffixHow the nvcc compiler interprets the file.cuCUDA source file, containing host and device code.cup Preprocessed CUDA source file, containing host code and device functions.c'C’ source file.cc,.cxx,.cppC++ source file.gpuGPU intermediate file (device code only).ptxPTX intermediate assembly file (device code only).cubinCUDA device only binary file 13

CUDA API: Device Memory Allocation (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host 14 cudaMalloc() Global Memory Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object HK-UIUC

Example Use: A Matrix Data Type NOT part of CUDA API It will be frequently used in many code examples 2 D matrix Single precision float elements Width * height elements Matrix entries attached to the pointer-to-float member called “elements” Matrix is stored row-wise typedef struct { int width; int height; float* elements; } Matrix; 15

CUDA Device Memory Allocation (cont.) Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md.elements “d” is often used to indicate a device data structure BLOCK_SIZE = 64; Matrix Md; int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float); cudaMalloc((void**)&Md.elements, size); … //use it for what you need, then free the device memory cudaFree(Md.elements); 16 HK-UIUC All the details are spelled out in the CUDA Programming Guide 3.2 (see the resources section of the class website)

CUDA Host-Device Data Transfer cudaMemcpy() memory data transfer Requires four parameters Pointer to source Pointer to destination Number of bytes copied Type of transfer  Host to Host  Host to Device  Device to Host  Device to Device (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host 17 HK-UIUC

CUDA Host-Device Data Transfer (cont.) Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice); cudaMemcpy(M.elements, Md.elements, size, cudaMemcpyDeviceToHost); 18 HK-UIUC

Assignment 2 Pseudocode [short detour, helpful with assignment] Problem 2 can be implemented as follows (four steps): Step 1: Allocate memory on the device (see cudaMalloc) Step 2: Invoke kernel with one block, the block has four threads (see vector add example for passing the device pointer to the kernel) NOTE: each of the four threads populates the allocated device memory with the result it computes Step 3: Copy back to host the data in the device array (see cudaMemcpy) Step 4: Free the memory allocated on the device (see cudaFree) 19

Simple Example: Matrix Multiplication A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later For now, concentrate on Local variable and register usage Thread ID usage Memory data transfer API between host and device 20 HK-UIUC

Square Matrix Multiplication Example Compute P = M * N The matrices P, M, N are of size WIDTH x WIDTH Software Design Decisions: One thread handles one element of P Each thread will access all the entries in one row of M and one column of N 2*WIDTH read accesses to global memory One write access to global memory M N P WIDTH 21

Multiply Using One Thread Block One Block of threads computes matrix P Each thread computes one element of P Each thread Loads a row of matrix M Loads a column of matrix N Perform one multiply and addition for each pair of M and N elements Compute to off-chip memory access ratio close to 1:1 Not that good, acceptable for now… Size of matrix limited by the number of threads allowed in a thread block Grid 1 Block 1 48 Thread (2, 2) width M P N 22 HK-UIUC

Step 1: Matrix Multiplication A Simple Host Code in C // Matrix multiplication on the (CPU) host in double precision; void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) { for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; //you’ll see a lot of this… double b = N.elements[k * N.width + j]; // and of this as well… sum += a * b; } P.elements[i * N.width + j] = sum; } 23 HK-UIUC

Step 2: Matrix Multiplication, Host-side. Main Program Code int main(void) { // Allocate and initialize the matrices. // The last argument in AllocateMatrix: should an initialization with // random numbers be done? Yes: 1. No: 0 (everything is set to zero) Matrix M = AllocateMatrix(WIDTH, WIDTH, 1); Matrix N = AllocateMatrix(WIDTH, WIDTH, 1); Matrix P = AllocateMatrix(WIDTH, WIDTH, 0); // M * N on the device MatrixMulOnDevice(M, N, P); // Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P); return 0; } 24 HK-UIUC

Step 3: Matrix Multiplication Host-side code 25 // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(WIDTH, WIDTH); // Launch the kernel on the device MatrixMulKernel >>(Md, Nd, Pd); // Read P from the device CopyFromDeviceMatrix(P, Pd); // Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd); } Continue here… // Matrix multiplication on the device void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P) { // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N); // Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); HK-UIUC

Step 4: Matrix Multiplication- Device-side Kernel Function // Matrix multiplication kernel – thread specification __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread Index. In the business of computing P[ty][tx]… int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue will end up storing the value of P[ty][tx]. That is, // P.elements[ty * P. width + tx] = Pvalue float Pvalue = 0; for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.width + k]; float Nelement = N.elements[k * N. width + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; // each thread writes one element P.elements[ty * P. width + tx] = Pvalue; } 26 M N P WIDTH tx ty

Step 5: Some Loose Ends // Free a device matrix. void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements); } void FreeMatrix(Matrix M) { free(M.elements); } 27 // Allocate a device matrix of same size as M. Matrix AllocateDeviceMatrix(const Matrix M) { Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice; } // Copy a host matrix to a device matrix. void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) { int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size, cudaMemcpyHostToDevice); } // Copy a device matrix to a host matrix. void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) { int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size, cudaMemcpyDeviceToHost); } HK-UIUC

The Common Pattern to CUDA Programming Phase 1: Allocate memory on the device and copy to the device the data required to carry out computation on the GPU Phase 2: Let the GPU crunch the numbers based on the kernel that you defined Phase 3: Bring back the results from the GPU. Free memory on the device (clean up…). You’re done. 28

Timing Your Application Timing support – part of the API You pick it up as soon as you include Why is good to use Provides cross-platform compatibility Deals with the asynchronous nature of the device calls by relying on events and forced synchronization Resolution: miliseconds. From NVIDIA CUDA Library Documentation:  Computes the elapsed time between two events (in milliseconds with a resolution of around 0.5 microseconds). If either event has not been recorded yet, this function returns cudaErrorInvalidValue. If either event has been recorded with a non-zero stream, the result is undefined. 29

Timing Example Timing a query of device 0 properties 30 #include int main() { cudaEvent_t startEvent, stopEvent; cudaEventCreate(&startEvent); cudaEventCreate(&stopEvent); cudaEventRecord(startEvent, 0); cudaDeviceProp deviceProp; const int currentDevice = 0; if (cudaGetDeviceProperties(&deviceProp, currentDevice) == cudaSuccess) printf("Device %d: %s\n", currentDevice, deviceProp.name); cudaEventRecord(stopEvent, 0); cudaEventSynchronize(stopEvent); float elapsedTime; cudaEventElapsedTime(&elapsedTime, startEvent, stopEvent); std::cout << "Time to get device properties: " << elapsedTime << " ms\n"; cudaEventDestroy(startEvent); cudaEventDestroy(stopEvent); return 0; }