Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

Algorithm Engineering „GPGPU“ Stefan Edelkamp

Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the masses“  Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE

Programming the Graphics Processing Unit with Cuda

Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example

Cluster / Multicore / GPU  Cluster system many unique systems each one  one (or more) processors  internal memory  often HDD communication over network  slow compared to internal  no shared memory CPURAM HDD CPURAM HDD CPURAM HDD Switch

Cluster / Multicore / GPU  Multicore systems multiple CPUs RAM external memory on HDD communication over RAM CPU1CPU2 CPU4CPU3 RAM HDD

Cluster / Multicore / GPU  System with a Graphic Processing Unit Many (240) Parallel processing units Hierarchical memory structure  RAM  VideoRAM  SharedRAM Communication  PCI BUS Graphics Card GPU SRAM VRAM RAM CPU Hard Disk Drive

Computing on the GPU  Hierarchical execution Groups  executed sequentially Threads  executed parallel  lightweight (creation / switching nearly free)‏ one Kernel function  executed by each thread Group 0

Computing on the GPU  Hierarchical memory Video RAM Video RAM  1 GB  Comparable to RAM Shared RAM in the GPU  16 KB  Comparable to registers  parallel access by threads Graphic Card GPU SRAM VideoRAM

Beispielarchitektur G200 z.B. in 280GTX

Beispielprobleme

Ranking und Unranking mit Parity

2-Bit BFS

1-Bit BFS

Schiebepuzzle

Some Results…

Weitere Resultate …

GPGPU Languages  RapidMind Supports MultiCore, ATI, NVIDIA and Cell C++ analysed and compiled for target hardware  Accelerator (Microsoft)‏ Library for.NET language  BrookGPU (Stanford University)‏ Supports ATI, NVIDIA Own Language, variant of ANSI C

Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  Programming languages  CUDA  Small Example

CUDA  Programming language  Similar to C  File suffix.cu  Own compiler called nvcc  Can be linked to C

CUDA C++ codeCUDA Code Compile with GCCCompile with nvcc Link with ld Executable

CUDA  Additional variable types Dim3 Int3 Char3

CUDA  Different types of functions __global__ invoked from host __device__ called from device  Different types of variables __device__ located in VRAM __shared__ located in SRAM

CUDA  Calling the kernel function name >>(...)‏  Grid dimensions (groups)‏  Block dimensions (threads)‏

CUDA  Memory handling CudaMalloc(...) - allocating VRAM CudaMemcpy(...) - copying Memory CudaFree(...) - free VRAM

CUDA  Distinguish threads blockDim – Number of all groups blockIdx – Id of Group (starting with 0)‏ threadIdx – Id of Thread (starting with 0)‏ Id = blockDim.x*blockIdx.x+threadIdx.x

Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  Programming languages  CUDA  Small Example

CUDA void inc(int *a, int b, int N) { for (int i = 0; i<N; i++) a[i] = a[i] + b; } void main()‏ {... inc(a,b,N); } __global__ void inc(int *a, int b, int N)‏ { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main()‏ {... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc >>(a_d,b,N); }

Realworld Example  LTL Model checking Traversing an implicit Graph G=(V,E)‏ Vertices called states Edges represented by transitions Duplicate removal needed

Realworld Example  External Model checking Generate Graph with external BFS Each BFS layer needs to be sorted  GPU proven to be fast in sorting

Realworld Example  Challenges Millions of states in one layer Huge state size Fast access only in SRAM Elements needs to be moved

Realworld Example  Solutions: Gpuqsort  Qsort optimized for GPUs  Intensive swapping in VRAM Bitonic based sorting  Fast for subgroups  Concatenating Groups slow

Realworld Example  Our solution States S presorted by Hash H(S) Bucket sorted in SRAM by a Group VRAM SRAM

Realworld Example  Our solution Order given by H(S),S

Realworld Example  Results

Questions??? Programming the GPU

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

Similar presentations

Presentation on theme: "Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

Similar presentations

Presentation on theme: "Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the."— Presentation transcript:

Similar presentations

About project

Feedback