CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA and the Memory Model (Part II). Code executed on GPU.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
GPU Programming with CUDA – Optimisation Mike Griffiths
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CUDA Programming Model
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Some things are naturally parallel
Programming Massively Parallel Graphics Processors
6- General Purpose GPU Programming
Presentation transcript:

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski

What is CUDA? CUDA is a set of developing tools to create applications that will perform execution on GPU (Graphics Processing Unit). CUDA compiler uses variation of C with future support of C++ CUDA was developed by NVidia and as such can only run on NVidia GPUs of G8x series and up. CUDA was released on February 15, 2007 for PC and Beta version for MacOS X on August 19, 2008.

Why CUDA? CUDA provides ability to use high-level languages such as C to develop application that can take advantage of high level of performance and scalability that GPUs architecture offer. GPUs allow creation of very large number of concurrently executed threads at very low system resource cost. CUDA also exposes fast shared memory (16KB) that can be shared between threads. Full support for integer and bitwise operations. Compiled code will run directly on GPU.

CUDA limitations No support of recursive function. Any recursive function must be converted into loops. Many deviations from Floating Point Standard (IEEE 754). No texture rendering. Bus bandwidth and latency between GPU and CPU is a bottleneck for many applications. Threads should only be run in groups of 32 and up for best performance. Only supported on NVidia GPUs

GPU vs CPU GPUs contain much larger number of dedicated ALUs then CPUs. GPUs also contain extensive support of Stream Processing paradigm. It is related to SIMD ( Single Instruction Multiple Data) processing. Each processing unit on GPU contains local memory that improves data manipulation and reduces fetch time.

CUDA Toolkit content The nvcc C compiler. CUDA FFT (Fast Fourier Transform) and BLAS (Basic Linear Algebra Subprograms for linear algebra) libraries for the GPU. Profiler. An alpha version of the gdb debugger for the GPU. CUDA runtime driver. CUDA programming manual.

CUDA Example 1 #define COUNT 10 #include int main(void) { float* pDataCPU = 0; float* pDataGPU = 0; int i = 0; //allocate memory on host pDataCPU = (float*)malloc(sizeof(float) * COUNT);

CUDA Example 1 (continue) //allocate memory on GPU cudaMalloc((void**) &pDataGPU, sizeof(float) * COUNT); //initialize host data for(i = 0; i < COUNT; i++) { pDataCPU[i] = i; } //copy data from host to GPU cudaMemcpy(pDataGPU, pDataCPU, sizeof(float) * COUNT, cudaMemcpyHostToDevice);

CUDA Example 1 (continue) //do something on GPU (Example 2 adds here) //copy result data back to host cudaMemcpy(pDataCPU, pDataGPU, sizeof(float) * COUNT, cudaMemcpyDeviceToHost); //release memory free(pDataCPU); cudaFree(pDataGPU) return 0; }

CUDA Example 1 (notes) This examples does following: o Allocates memory on host and device (GPU). o Initializes data on host. o Performs data copy from host to device. o After some arbitrary processing data is copied from device to host. o Memory is freed from both host and device. cudaMemcpy() is function that allows basic data move operation.There are several operators that are passed in: o cudaMemcpyHostToDevice - copy from CPU->GPU. o cudaMemcpyDeviceToHost - copy from GPU->CPU. o cudaMemcpyDeviceToDevice - copy data between allocated memory buffers on device.

CUDA Example 1 (notes continue) Memory allocation is done using cudaMalloc() and deallocation cudaFree() Maximum of allocated memory is device specific. Source files must have extension ".cu".

CUDA Example 2 (notes) For many operations CUDA is using kernel functions. These functions are called from device (GPU) and are executed on it simultaneously by many threads in parallel. CUDA provides several extensions to the C-language. "__global__" declares kernel function that will be executed on CUDA device. Return type for all these functions is void. We define these functions. Example 2 will feature incrementArrayOnDevice CUDA kernel function. Its purpose is to increment values of each element of an array. All elements will be incremented by this single instruction, in the same time using parallel execution and multiple threads.

CUDA Example 2 We will modify example 1 by adding code in between memory copy from host to device and from device to host. We will also define following kernel function: __global__ void incrementArrayOnDevice(float* a, int size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if(idx < size) { a[idx] = a[idx] + 1; } Explanation of this function will follow after code.

CUDA Exmple 2 //inserting code to perform operations on GPU int nBlockSize = 4; int nBlocks = COUNT / nBlockSize + (COUNT % nBlockSize == 0 ? 0 : 1); //calling kernel function incrementArrayOnDevice > (pDataGPU, COUNT); //rest of the code

CUDA Example 2 (notes) When we call kernel function we provide configuration values for that function. Those values are included within " >>" brackets. In order to understand nBlock and nBlockSize configuration values we must examine what is thread blocks. Thread block is organization of processing units that can communicate and synchronize with each other. Higher number of threads per block involves higher cost of hardware since blocks are physical devices on GPU.

Example 2 (notes continue) Grid Abstraction was introduced to solve problem with different hardware having different number of threads per block. In Example 2 nBlockSize identifies number of threads per block. Then we use this information to calculate number of blocks needed to perform kernel call based on number of elements in the array. Computed value is nBlocks. There are several built in variables that are available to kernel call: o blockIdx - block index within grid. o threadIdx - thread index within block. o blockDim - number of threads in a block.

Example 2 (notes continue) Diagram of block breakdown and thread assignment for our array. (Rob Farber, "CUDA, Supercomputing for the Masses: Part 2", Dr.Dobbs,

CUDA - Code execution flow At application start of execution CUDA's compiled code runs like any other application. Its primary execution is happening in CPU. When kernel call is made, application continue execution of non-kernel function on CPU. In the same time, kernel function does its execution on GPU. This way we get parallel processing between CPU and GPU. Memory move between host and device is primary bottleneck in application execution. Execution on both is halted until this operation completes.

CUDA - Error Handling For non-kernel CUDA calls return value of type cudaError_t is provided to requestor. Human-radable description can be obtained by char* cudaGetErrorString(cudaError_t code); CUDA also provides method to retrieve last error of any previous runtime call cudaGetLastError(). There are some considirations: o Use cudaThreadSynchronize() to block for all kernel calls to complete. This method will return error code if such occur. We must use this otherwise nature of asynchronous execution of kernel will prevent us from getting accurate result.

CUDA - Error Handling (continue) o cudaGetLastError() only return last error reported. Therefore developer must take care to properly requesting error code.

CUDA - Memory Model Diagram depicting memory organization. (Rob Farber, "CUDA, Supercomputing for the Masses: Part 4", Dr.Dobbs, computing/ )

CUDA - Memory Model (continue) Each block contain following: o Set of local registers per thread. o Parallel data cache or shared memory that is shared by all the threads. o Read-only constant cache that is shared by all the threads and speeds up reads from constant memory space. o Read-only texture cache that is shared by all the processors and speeds up reads from the texture memory space. Local memory is in scope of each thread. It is allocated by compiler from global memory but logically treated as independent unit.

CUDA - Memory Units Description Registers: o Fastest. o Only accessible by a thread. o Lifetime of a thread Shared memory: o Could be as fast as registers if no bank conflicts or reading from same address. o Accessible by any threads within a block where it was created. o Lifetime of a block.

CUDA - Memory Units Description (continue) Global Memory: o Up to 150x slower then registers or share memory. o Accessible from either host or device. o Lifetime of an application. Local Memory o Resides in global memory. Can be 150x slower then registers and shared memory. o Accessible only by a thread. o Lifetime of a thread.

CUDA - Uses CUDA provided benefit for many applications. Here list of some: o Seismic Database - 66x to 100x speedup o Molecular Dynamics - 21x to 100x speedup o MRI processing - 245x to 415x speedup o Atmospheric Cloud Simulation - 50x speedup

CUDA - Resources & References CUDA, Supercomputing for the Masses by Rob Farber. o CUDA, Wikipedia. o Cuda for developers, Nvidia. o Download CUDA manual and binaries. o