GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
Dynamic Parallelism (CUDA 5+) GPU Object Linking (CUDA 5+) Unified Memory (CUDA 6+) Other Developer Tools Overview
Before CUDA 5 threads had to be launched from the host Limited ability to perform recursive functions Dynamic Parallelism allows threads to be launched from the device Improved load balancing Deep Recursion Dynamic Parallelism CPU Kernel A Kernel B Kernel C Kernel D GPU
//Host Code... A >>(data); B >>(data); C >>(data); //Kernel Code __global__ void vectorAdd(float *data) { do_stuff(data); X >>(data); do_more stuff(data); } An Example
CUDA 4 required a single source file for a single kernel No linking of compiled device code CUDA 5.0+ Allows different object files to be linked Kernels and host code can be built independently GPU Object Linking Main.cpp ___________________________ a.cu____________________b.cu____________________c.cu____________________ a.ob.oc.o + Program.exe
Objects can also be built into static libraries Shared by different sources Much better code reuse Reduces compilation time Closed source device libraries GPU Object Linking Main.cpp ___________________________ a.cu____________________b.cu____________________ a.ob.o ab.culib + Program.exe + + Main2.cpp ___________________________ ab.culib Program2.exe + + foo.cubar.cu...
Developer view is that GPU and CPU have separate memory Memory must be explicitly copied Deep copies required for complex data structures Unified Memory changes that view Single pointer to data accessible anywhere Simpler code porting Unified Memory System Memory GPU Memory CPUGPU Unified Memory CPUGPU
Unified Memory Example void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); } void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort(data, N, 1, compare); cudaDeviceSynchronize(); use_data(data); free(data); }
XT and Drop-in libraries cuFFT and cuBLAS optimised for multi GPU (on the same node) GPUDirect Direct Transfer between GPUs (cut out the host) To support direct transfer via Infiniband (over a network) Developer Tools Remote Development using Nsight Eclipse Enhanced Visual Profiler Other Developer Tools