Algorithm Engineering „GPGPU“ Stefan Edelkamp
Graphics Processing Units GPGPU = (GP)²U General Purpose Programming on the GPU „Parallelism for the masses“ Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE
Programming the Graphics Processing Unit with Cuda
Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example
Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example
Cluster / Multicore / GPU Cluster system many unique systems each one one (or more) processors internal memory often HDD communication over network slow compared to internal no shared memory CPURAM HDD CPURAM HDD CPURAM HDD Switch
Cluster / Multicore / GPU Multicore systems multiple CPUs RAM external memory on HDD communication over RAM CPU1CPU2 CPU4CPU3 RAM HDD
Cluster / Multicore / GPU System with a Graphic Processing Unit Many (240) Parallel processing units Hierarchical memory structure RAM VideoRAM SharedRAM Communication PCI BUS Graphics Card GPU SRAM VRAM RAM CPU Hard Disk Drive
Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example
Computing on the GPU Hierarchical execution Groups executed sequentially Threads executed parallel lightweight (creation / switching nearly free) one Kernel function executed by each thread Group 0
Computing on the GPU Hierarchical memory Video RAM Video RAM 1 GB Comparable to RAM Shared RAM in the GPU 16 KB Comparable to registers parallel access by threads Graphic Card GPU SRAM VideoRAM
Beispielarchitektur G200 z.B. in 280GTX
Beispielprobleme
Ranking und Unranking mit Parity
2-Bit BFS
1-Bit BFS
Schiebepuzzle
Some Results…
Weitere Resultate …
Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example
GPGPU Languages RapidMind Supports MultiCore, ATI, NVIDIA and Cell C++ analysed and compiled for target hardware Accelerator (Microsoft) Library for.NET language BrookGPU (Stanford University) Supports ATI, NVIDIA Own Language, variant of ANSI C
Overview Cluster / Multicore / GPU comparison Computing on the GPU Programming languages CUDA Small Example
CUDA Programming language Similar to C File suffix.cu Own compiler called nvcc Can be linked to C
CUDA C++ codeCUDA Code Compile with GCCCompile with nvcc Link with ld Executable
CUDA Additional variable types Dim3 Int3 Char3
CUDA Different types of functions __global__ invoked from host __device__ called from device Different types of variables __device__ located in VRAM __shared__ located in SRAM
CUDA Calling the kernel function name >>(...) Grid dimensions (groups) Block dimensions (threads)
CUDA Memory handling CudaMalloc(...) - allocating VRAM CudaMemcpy(...) - copying Memory CudaFree(...) - free VRAM
CUDA Distinguish threads blockDim – Number of all groups blockIdx – Id of Group (starting with 0) threadIdx – Id of Thread (starting with 0) Id = blockDim.x*blockIdx.x+threadIdx.x
Overview Cluster / Multicore / GPU comparison Computing on the GPU Programming languages CUDA Small Example
CUDA void inc(int *a, int b, int N) { for (int i = 0; i<N; i++) a[i] = a[i] + b; } void main() {... inc(a,b,N); } __global__ void inc(int *a, int b, int N) { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main() {... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc >>(a_d,b,N); }
Realworld Example LTL Model checking Traversing an implicit Graph G=(V,E) Vertices called states Edges represented by transitions Duplicate removal needed
Realworld Example External Model checking Generate Graph with external BFS Each BFS layer needs to be sorted GPU proven to be fast in sorting
Realworld Example Challenges Millions of states in one layer Huge state size Fast access only in SRAM Elements needs to be moved
Realworld Example Solutions: Gpuqsort Qsort optimized for GPUs Intensive swapping in VRAM Bitonic based sorting Fast for subgroups Concatenating Groups slow
Realworld Example Our solution States S presorted by Hash H(S) Bucket sorted in SRAM by a Group VRAM SRAM
Realworld Example Our solution Order given by H(S),S
Realworld Example Results
Questions??? Programming the GPU