Download presentation
Presentation is loading. Please wait.
Published byDylan Allen Caldwell Modified over 9 years ago
1
Algorithm Engineering „GPGPU“ Stefan Edelkamp
2
Graphics Processing Units GPGPU = (GP)²U General Purpose Programming on the GPU „Parallelism for the masses“ Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE
3
Programming the Graphics Processing Unit with Cuda
4
Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example
5
Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example
6
Cluster / Multicore / GPU Cluster system many unique systems each one one (or more) processors internal memory often HDD communication over network slow compared to internal no shared memory CPURAM HDD CPURAM HDD CPURAM HDD Switch
7
Cluster / Multicore / GPU Multicore systems multiple CPUs RAM external memory on HDD communication over RAM CPU1CPU2 CPU4CPU3 RAM HDD
8
Cluster / Multicore / GPU System with a Graphic Processing Unit Many (240) Parallel processing units Hierarchical memory structure RAM VideoRAM SharedRAM Communication PCI BUS Graphics Card GPU SRAM VRAM RAM CPU Hard Disk Drive
9
Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example
10
Computing on the GPU Hierarchical execution Groups executed sequentially Threads executed parallel lightweight (creation / switching nearly free) one Kernel function executed by each thread Group 0
11
Computing on the GPU Hierarchical memory Video RAM Video RAM 1 GB Comparable to RAM Shared RAM in the GPU 16 KB Comparable to registers parallel access by threads Graphic Card GPU SRAM VideoRAM
12
Beispielarchitektur G200 z.B. in 280GTX
13
Beispielprobleme
14
Ranking und Unranking mit Parity
15
2-Bit BFS
16
1-Bit BFS
17
Schiebepuzzle
18
Some Results…
19
Weitere Resultate …
20
Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example
21
GPGPU Languages RapidMind Supports MultiCore, ATI, NVIDIA and Cell C++ analysed and compiled for target hardware Accelerator (Microsoft) Library for.NET language BrookGPU (Stanford University) Supports ATI, NVIDIA Own Language, variant of ANSI C
22
Overview Cluster / Multicore / GPU comparison Computing on the GPU Programming languages CUDA Small Example
23
CUDA Programming language Similar to C File suffix.cu Own compiler called nvcc Can be linked to C
24
CUDA C++ codeCUDA Code Compile with GCCCompile with nvcc Link with ld Executable
25
CUDA Additional variable types Dim3 Int3 Char3
26
CUDA Different types of functions __global__ invoked from host __device__ called from device Different types of variables __device__ located in VRAM __shared__ located in SRAM
27
CUDA Calling the kernel function name >>(...) Grid dimensions (groups) Block dimensions (threads)
28
CUDA Memory handling CudaMalloc(...) - allocating VRAM CudaMemcpy(...) - copying Memory CudaFree(...) - free VRAM
29
CUDA Distinguish threads blockDim – Number of all groups blockIdx – Id of Group (starting with 0) threadIdx – Id of Thread (starting with 0) Id = blockDim.x*blockIdx.x+threadIdx.x
30
Overview Cluster / Multicore / GPU comparison Computing on the GPU Programming languages CUDA Small Example
31
CUDA void inc(int *a, int b, int N) { for (int i = 0; i<N; i++) a[i] = a[i] + b; } void main() {... inc(a,b,N); } __global__ void inc(int *a, int b, int N) { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main() {... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc >>(a_d,b,N); }
32
Realworld Example LTL Model checking Traversing an implicit Graph G=(V,E) Vertices called states Edges represented by transitions Duplicate removal needed
33
Realworld Example External Model checking Generate Graph with external BFS Each BFS layer needs to be sorted GPU proven to be fast in sorting
34
Realworld Example Challenges Millions of states in one layer Huge state size Fast access only in SRAM Elements needs to be moved
35
Realworld Example Solutions: Gpuqsort Qsort optimized for GPUs Intensive swapping in VRAM Bitonic based sorting Fast for subgroups Concatenating Groups slow
36
Realworld Example Our solution States S presorted by Hash H(S) Bucket sorted in SRAM by a Group VRAM SRAM
37
Realworld Example Our solution Order given by H(S),S
38
Realworld Example Results
39
Questions??? Programming the GPU
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.