Download presentation
Presentation is loading. Please wait.
Published byAndrew Houston Modified over 9 years ago
1
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/
2
Dynamic Parallelism (CUDA 5+) GPU Object Linking (CUDA 5+) Unified Memory (CUDA 6+) Other Developer Tools Overview
3
Before CUDA 5 threads had to be launched from the host Limited ability to perform recursive functions Dynamic Parallelism allows threads to be launched from the device Improved load balancing Deep Recursion Dynamic Parallelism CPU Kernel A Kernel B Kernel C Kernel D GPU
4
//Host Code... A >>(data); B >>(data); C >>(data); //Kernel Code __global__ void vectorAdd(float *data) { do_stuff(data); X >>(data); do_more stuff(data); } An Example
5
CUDA 4 required a single source file for a single kernel No linking of compiled device code CUDA 5.0+ Allows different object files to be linked Kernels and host code can be built independently GPU Object Linking Main.cpp ___________________________ a.cu____________________b.cu____________________c.cu____________________ a.ob.oc.o + Program.exe
6
Objects can also be built into static libraries Shared by different sources Much better code reuse Reduces compilation time Closed source device libraries GPU Object Linking Main.cpp ___________________________ a.cu____________________b.cu____________________ a.ob.o ab.culib + Program.exe + + Main2.cpp ___________________________ ab.culib Program2.exe + + foo.cubar.cu...
7
Developer view is that GPU and CPU have separate memory Memory must be explicitly copied Deep copies required for complex data structures Unified Memory changes that view Single pointer to data accessible anywhere Simpler code porting Unified Memory System Memory GPU Memory CPUGPU Unified Memory CPUGPU
8
Unified Memory Example void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); } void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort(data, N, 1, compare); cudaDeviceSynchronize(); use_data(data); free(data); }
9
XT and Drop-in libraries cuFFT and cuBLAS optimised for multi GPU (on the same node) GPUDirect Direct Transfer between GPUs (cut out the host) To support direct transfer via Infiniband (over a network) Developer Tools Remote Development using Nsight Eclipse Enhanced Visual Profiler Other Developer Tools
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.