Download presentation
Presentation is loading. Please wait.
Published byMargery Chambers Modified over 9 years ago
1
CUDA - 101 Basics
2
Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication
3
GPU revised!
4
What is CUDA? Compute Device Unified Architecture Programming interface to GPU Supports C/C++ and Fortran natively – Third party wrappers for Python, Java, MATLAB etc Various libraries available – cuBLAS, cuFFT and many more… – https://developer.nvidia.com/gpu-accelerated- libraries https://developer.nvidia.com/gpu-accelerated- libraries
5
CUDA computing stack
9
Data Parallel programming i1 Kernel i2 i3… iN o1 o2 o3… oN
10
Data parallel algorithm Dot product : C = A. B A1 B1… C1 C2 C3… CN A2 B2 A3 B3 AN BN + ++ ++ Kernel
11
Host-Device model CPU (Host) GPU (Device)
12
Threads A thread is an instance of the kernel program – Independent in a data parallel model – Can be executed on a different core Host tells the device to run a kernel program – And how many threads to launch
13
Matrix-Multiplication
14
CPU-only MatrixMultiplication Execute this code For all elements of P
15
Memory Indexing in C (and CUDA) M(i, j) = M[i + j * width]
16
CUDA version - I
17
CUDA program flow Allocate input and output memory on host – Do the same for device Transfer input data from host -> device Launch kernel on device Transfer output data from device -> host
18
Allocating Device memory Host tells the device when to allocate and free memory in device Functions for host-program – cudaMalloc(memory reference, size) – cudaFree(memory reference)
19
Transfer Data to/from device Again, host tells device when to transfer data cudaMemcpy(target, source, size, flag)
20
CUDA version - 2 Host Memory Device Memory Allocate matrix M on device Transfer M from host -> Device Allocate matrix N on device Transfer N from host -> Device Allocate matrix P on device Execute Kernel on Device Transfer P from Device-> Host Free Device memories for M, N and P
21
Matrix Multiplication Kernel Kernel specifies the function to be executed on Device Parameters = Device memories, width Thread = Each element of output matrix P Dot product of M’s row and N’s column Write dot product at current location
22
Extensions : Function qualifiers
23
Extensions : Thread indexing All threads execute the same code – But they need work on separate memory data threadId.x & threadId.y – These variables automatically receive corresponding values for their threads
24
Thread Grid Represents group of all threads to be executed for a particular kernel Two level hierarchy – Grid is composed of Blocks – Each Block is composed of threads
25
Thread Grid 0, 01, 02, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1
26
Conclusion Sample code and tutorials CUDA nodes? Programming guide – http://docs.nvidia.com/cuda/cuda-c- programming-guide/ http://docs.nvidia.com/cuda/cuda-c- programming-guide/ SDK – https://developer.nvidia.com/cuda-downloads – Available for windows, Mac and Linux – Lot of sample programs
27
QUESTIONS?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.