CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

CUDA - 101 Basics

Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

GPU revised!

What is CUDA? Compute Device Unified Architecture Programming interface to GPU Supports C/C++ and Fortran natively – Third party wrappers for Python, Java, MATLAB etc Various libraries available – cuBLAS, cuFFT and many more… – https://developer.nvidia.com/gpu-accelerated- libraries https://developer.nvidia.com/gpu-accelerated- libraries

CUDA computing stack

Data Parallel programming i1 Kernel i2 i3… iN o1 o2 o3… oN

Data parallel algorithm Dot product : C = A. B A1 B1… C1 C2 C3… CN A2 B2 A3 B3 AN BN + ++ ++ Kernel

Host-Device model CPU (Host) GPU (Device)

Threads A thread is an instance of the kernel program – Independent in a data parallel model – Can be executed on a different core Host tells the device to run a kernel program – And how many threads to launch

Matrix-Multiplication

CPU-only MatrixMultiplication Execute this code For all elements of P

Memory Indexing in C (and CUDA) M(i, j) = M[i + j * width]

CUDA version - I

CUDA program flow Allocate input and output memory on host – Do the same for device Transfer input data from host -> device Launch kernel on device Transfer output data from device -> host

Allocating Device memory Host tells the device when to allocate and free memory in device Functions for host-program – cudaMalloc(memory reference, size) – cudaFree(memory reference)

Transfer Data to/from device Again, host tells device when to transfer data cudaMemcpy(target, source, size, flag)

CUDA version - 2 Host Memory Device Memory Allocate matrix M on device Transfer M from host -> Device Allocate matrix N on device Transfer N from host -> Device Allocate matrix P on device Execute Kernel on Device Transfer P from Device-> Host Free Device memories for M, N and P

Matrix Multiplication Kernel Kernel specifies the function to be executed on Device Parameters = Device memories, width Thread = Each element of output matrix P Dot product of M’s row and N’s column Write dot product at current location

Extensions : Function qualifiers

Extensions : Thread indexing All threads execute the same code – But they need work on separate memory data threadId.x & threadId.y – These variables automatically receive corresponding values for their threads

Thread Grid Represents group of all threads to be executed for a particular kernel Two level hierarchy – Grid is composed of Blocks – Each Block is composed of threads

Thread Grid 0, 01, 02, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1

Conclusion Sample code and tutorials CUDA nodes? Programming guide – http://docs.nvidia.com/cuda/cuda-c- programming-guide/ http://docs.nvidia.com/cuda/cuda-c- programming-guide/ SDK – https://developer.nvidia.com/cuda-downloads – Available for windows, Mac and Linux – Lot of sample programs

QUESTIONS?

CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Similar presentations

Presentation on theme: "CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Similar presentations

Presentation on theme: "CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication."— Presentation transcript:

Similar presentations

About project

Feedback