3/12/2013Computer Engg, IIT(BHU)1 CUDA-3
GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical path of application ● Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation
GPGPU Constraints ● Dealing with graphics API – Working with the corner cases of the graphics API ● Addressing modes – Limited texture size/dimension ● Shader capabilities – Limited outputs ● Instruction sets – Lack of Integer & bit ops ● Communication limited – Between pixels – Scatter a[i] = p
CUDA ● General purpose programming model – User kicks off batches of threads on the GPU – GPU = dedicated super-threaded, massively data parallel co-processor ● Targeted software stack – Compute oriented drivers, language, and tools
CUDA ● Driver for loading computation programs into GPU – Standalone Driver - Optimized for computation – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Guaranteed maximum download & readback speeds – Explicit GPU memory management
Parallel Computing on a GPU NVIDIA GPU Computing Architecture – Via a separate HW interface – In laptops, desktops, workstations, servers 8-series GPUs deliver 50 to 200 GFLOPS on compiled parallel C applications
Parallel Computing on a GPU GPU parallelism is doubling every year Programming model scales transparently Programmable in C with CUDA tools Multithreaded SPMD model uses application-data parallelism and thread parallelism
CPU vs GPU
● GPU Baseline speedup is approximately 60x ● For 500,000 particles that is a reduction in calculation time from 33 minutes to 33 seconds!
Conclusion ● Without optimization we already got an amazing speedup on CUDA ● N 2 algorithm is “made” for CUDA ● Optimizations are hard to predict in advance tradeoffs
Conclusion ● There are ways to dynamically distribute workloads across a fixed number of blocks ● Biggest problem: how to handle dynamic results in global memory
Uses – CUDA provided benefit for many applications. Here list of some: ● Seismic Database - 66x to 100x speedup ● Molecular Dynamics - 21x to 100x speedup ● MRI processing - 245x to 415x speedup ● ● Atmospheric Cloud Simulation - 50x speedup
References – CUDA, Supercomputing for the Masses by Rob Farber. ● – CUDA, Wikipedia. ● – Cuda for developers, Nvidia. ● – Download CUDA manual and binaries. ●