Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Speed, Accurate and Efficient way to identify the DNA.
List Ranking and Parallel Prefix
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Lecture 2: Intro to the simd lifestyle and GPU internals
Programming Massively Parallel Graphics Processors
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
Chapter 4:Parallel Programming in CUDA C
6- General Purpose GPU Programming
Presentation transcript:

Algorithm Engineering „GPGPU“ Stefan Edelkamp

Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the masses“  Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE

Programming the Graphics Processing Unit with Cuda

Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example

Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example

Cluster / Multicore / GPU  Cluster system many unique systems each one  one (or more) processors  internal memory  often HDD communication over network  slow compared to internal  no shared memory CPURAM HDD CPURAM HDD CPURAM HDD Switch

Cluster / Multicore / GPU  Multicore systems multiple CPUs RAM external memory on HDD communication over RAM CPU1CPU2 CPU4CPU3 RAM HDD

Cluster / Multicore / GPU  System with a Graphic Processing Unit Many (240) Parallel processing units Hierarchical memory structure  RAM  VideoRAM  SharedRAM Communication  PCI BUS Graphics Card GPU SRAM VRAM RAM CPU Hard Disk Drive

Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example

Computing on the GPU  Hierarchical execution Groups  executed sequentially Threads  executed parallel  lightweight (creation / switching nearly free)‏ one Kernel function  executed by each thread Group 0

Computing on the GPU  Hierarchical memory Video RAM Video RAM  1 GB  Comparable to RAM Shared RAM in the GPU  16 KB  Comparable to registers  parallel access by threads Graphic Card GPU SRAM VideoRAM

Beispielarchitektur G200 z.B. in 280GTX

Beispielprobleme

Ranking und Unranking mit Parity

2-Bit BFS

1-Bit BFS

Schiebepuzzle

Some Results…

Weitere Resultate …

Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example

GPGPU Languages  RapidMind Supports MultiCore, ATI, NVIDIA and Cell C++ analysed and compiled for target hardware  Accelerator (Microsoft)‏ Library for.NET language  BrookGPU (Stanford University)‏ Supports ATI, NVIDIA Own Language, variant of ANSI C

Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  Programming languages  CUDA  Small Example

CUDA  Programming language  Similar to C  File suffix.cu  Own compiler called nvcc  Can be linked to C

CUDA C++ codeCUDA Code Compile with GCCCompile with nvcc Link with ld Executable

CUDA  Additional variable types Dim3 Int3 Char3

CUDA  Different types of functions __global__ invoked from host __device__ called from device  Different types of variables __device__ located in VRAM __shared__ located in SRAM

CUDA  Calling the kernel function name >>(...)‏  Grid dimensions (groups)‏  Block dimensions (threads)‏

CUDA  Memory handling CudaMalloc(...) - allocating VRAM CudaMemcpy(...) - copying Memory CudaFree(...) - free VRAM

CUDA  Distinguish threads blockDim – Number of all groups blockIdx – Id of Group (starting with 0)‏ threadIdx – Id of Thread (starting with 0)‏ Id = blockDim.x*blockIdx.x+threadIdx.x

Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  Programming languages  CUDA  Small Example

CUDA void inc(int *a, int b, int N) { for (int i = 0; i<N; i++) a[i] = a[i] + b; } void main()‏ {... inc(a,b,N); } __global__ void inc(int *a, int b, int N)‏ { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main()‏ {... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc >>(a_d,b,N); }

Realworld Example  LTL Model checking Traversing an implicit Graph G=(V,E)‏ Vertices called states Edges represented by transitions Duplicate removal needed

Realworld Example  External Model checking Generate Graph with external BFS Each BFS layer needs to be sorted  GPU proven to be fast in sorting

Realworld Example  Challenges Millions of states in one layer Huge state size Fast access only in SRAM Elements needs to be moved

Realworld Example  Solutions: Gpuqsort  Qsort optimized for GPUs  Intensive swapping in VRAM Bitonic based sorting  Fast for subgroups  Concatenating Groups slow

Realworld Example  Our solution States S presorted by Hash H(S) Bucket sorted in SRAM by a Group VRAM SRAM

Realworld Example  Our solution Order given by H(S),S

Realworld Example  Results

Questions??? Programming the GPU