Download presentation
Presentation is loading. Please wait.
Published byEdith Cobb Modified over 6 years ago
1
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
2
GPU Programming Revision
GPU Device A separate device connected via PCIe Quite independent on the host system But all high-level commands are issued from host Tailored for data-parallelism Originally to process many vertices/fragments with the same piece of code (shader) Requires specific code optimizations/fine-tuning Wrong approach to work decomposition or suboptimal memory access patterns can degrade the performance significantly by Martin Kruliš (v1.1)
3
CUDA Compute Unified Device Architecture Alternatives
NVIDIA parallel computing platform Implemented solely for GPUs First API released in 2007 Used in various libraries cuFFT, cuBLAS, … Many additional features OpenGL/DirectX Interoperability, computing clusters support, integrated compiler, … Alternatives OpenCL, AMD Stream SDK, C++ AMP by Martin Kruliš (v1.1)
4
CUDA vs. OpenCL CUDA OpenCL GPU-specific platform By NVIDIA
Faster changes Limited to NVIDIA hardware only Host and device code together Easier to tune for peak performance OpenCL Generic platform By Khronos Slower changes Supported by various vendors and devices Device code is compiled at runtime Easier for more portable application by Martin Kruliš (v1.1)
5
Terminology Some differences between CUDA and OpenCL CUDA OpenCL GPU
Multiprocessor CUDA Core Thread Thread Block Shared Memory Local Memory Registers OpenCL Device Compute Unit Compute Element Work Item Work Group Local Memory Private Memory by Martin Kruliš (v1.1)
6
Device index is from range 0, deviceCount-1
Device Detection Device Detection int deviceCount; cudaGetDeviceCount(&deviceCount); ... cudaSetDevice(deviceIdx); Querying Device Information cudaDeviceProp deviceProp; cudaGetDeviceProperties(&deviceProp, deviceIdx); Device index is from range 0, deviceCount-1 Example by Martin Kruliš (v1.1)
7
Device Features Compute Capability
Prescribed set of technologies and constants that a GPU device must implement Incrementally defined Architecture dependent CC for known architectures: 1.0, 1.3 – Tesla 2.0, 2.1 – Fermi (Tesla M2090 – CC 2.0) 3.x – Kepler (Tesla K20m – CC 3.5) 5.x – Maxwell (GTX 980 – CC 5.2) 6.x – Pascal (most recent), 7.x – Volta (planned) by Martin Kruliš (v1.1)
8
Memory Allocation Device Memory Allocation Host-Device Data Transfers
C-like allocation system The programmer must distinguish host/GPU pointers! float *vec; cudaMalloc((void**)&vec, count*sizeof(float)); cudaFree(vec); Host-Device Data Transfers Explicit blocking functions cudaMemcpy(vec, localVec, count*sizeof(float), cudaMemcpyHostToDevice); Note that C++ bindings (with type control) exist. by Martin Kruliš (v1.1)
9
Kernel Execution Kernel Special function declarations Kernel Execution
__device__ void foo(…) { … } __global__ void bar(…) { … } Kernel Execution bar<<<Dg, Db [, Ns [, S]]>>>(args); Dg – dimensions and sizes of blocks spawned Db – dimensions and sizes of threads per block Ns – dynamically allocated shared memory per block S – stream index by Martin Kruliš (v1.1)
10
Working Threads Grid Each Block Kernel Constants Consists of blocks
Up to 3 dimensions Each Block Consist of threads Same dimensionality Kernel Constants gridDim, blockDim blockIdx, threadIdx .x, .y, .z Note that logically, the threads (and blocks) are organized that the x- coordinate is closest. In other words, linearized index of a thread in a block is idx = x + y*blockDim.x + z*blockDim.x*blockDim.y. by Martin Kruliš (v1.1)
11
Code Example __global__ void vec_mul(float *X, float *Y, float *res) { int idx = blockIdx.x * blockDim.x + threadIdx.x; res[idx] = X[idx] * Y[idx]; } ... float *X, *Y, *res, *cuX, *cuY, *cuRes; cudaSetDevice(0); cudaMalloc((void**)&cuX, N * sizeof(float)); cudaMalloc((void**)&cuY, N * sizeof(float)); cudaMalloc((void**)&cuRes, N * sizeof(float)); cudaMemcpy(cuX, X, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(cuY, Y, N * sizeof(float), cudaMemcpyHostToDevice); vec_mul<<<(N/64), 64>>>(cuX, cuY, cuRes); cudaMemcpy(res, cuRes, N * sizeof(float), cudaMemcpyDeviceToHost); by Martin Kruliš (v1.1)
12
Compilation The nvcc Compiler
Used for compiling both host and device code Defers compilation of the host code to gcc (linux) or Microsoft VCC (Windows) $> nvcc –cudart static code.cu –o myapp Can be used for compilation only $> nvcc –compile … kernel.cu –o kernel.obj $> cc –lcudart kernel.obj main.obj –o myapp Device code is generated for target architecture $> nvcc –arch sm_13 … $> nvcc –arch compute_35 … Compile for real GPU with compute capability 1.3 and to PTX with capability 3.5 by Martin Kruliš (v1.1)
13
NVIDIA Tools System Management Interface NVIDIA Visual Profiler
nvidia-smi (CLI application) NVML library (C-based API) Query GPU details $> nvidia-smi -q Set various properties (ECC, compute mode …), … $> nvidia-smi –pm 1 Set drivers to persistent mode (recommended) NVIDIA Visual Profiler $> nvvp & Use X11 SSH forwarding from ubergrafik/knight For some reason, the “nvidia-smi –pm 1” did not work under new drivers under RHEL 7. Another command (“nvidia-persistenced --persistence-mode”) has to be used. Example by Martin Kruliš (v1.1)
14
Discussion by Martin Kruliš (v1.1)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.