1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA Grids, Blocks, and Threads
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
GPU Architecture and Programming
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Sathish Vadhiyar Parallel Programming
Basic CUDA Programming
Some things are naturally parallel
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Convolution Layer Optimization
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Chapter 4:Parallel Programming in CUDA C
Synchronization These notes introduce:
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

1 GPU programming Dr. Bernhard Kainz

2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Example program Memory model Tiling Reduction State-of-the-art applications Last week This week

3 Dr Bernhard Kainz Distinguishing between threads blockId and threadId 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,0 1,0 2,0 3,0 0,1 1,12,1 3,20,21,22,2 3,1 4,0 5,0 6,0 7,0 4,1 5,16,1 7,2 4,25,2 6,2 7,1 0,3 1,3 2,3 3,3 0,4 1,42,4 3,50,51,52,5 3,4 4,3 5,3 6,3 7,3 4,4 5,46,4 7,54,55,56,5 7,4 0,01,0 0,11,1 0,2 1,2 0,3 1,3 0,41,4 0,51,5 0,6 1,6 0,7 1,7 0,81,8 0,91,9 0,10 1,10 0,11 1,11

4 Dr Bernhard Kainz 2D Kernel example using threadIdx and blockIdx execution paths are chosen with blockDim and gridDim number of threads can be determined __global__ void myfunction(float *input, float* output) { uint bid = blockIdx.x + blockIdx.y * gridDim.x; uint tid = bId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x; output[tid] = input[tid]; } dim3 blockSize(32,32,1); dim3 gridSize((iSpaceX + blockSize.x - 1)/blockSize.x, (iSpaceY + blockSize.y - 1)/blockSize.y), 1) myfunction >>(input, output);

5 Dr Bernhard Kainz Matrix Multiplication Example

6 Dr Bernhard Kainz Matrix Multiplication Example Loop-based parallelism

7 Dr Bernhard Kainz Matrix Multiplication Example

8 Dr Bernhard Kainz Matrix Multiplication Example float* A = new float[A_rows*A_cols]; float* B = new float[B_rows*B_cols]; float* C = new float[B_cols*A_rows]; //some matrix initialization float* d_A, d_B, d_C; cudaMalloc((void**)&d_A, A_rows*A_cols*sizeof(float)); cudaMalloc((void**)&d_B, B_rows*B_cols*sizeof(float)); cudaMalloc((void**)&d_C, B_cols*A_rows*sizeof(float)); cudaMemcpy(d_A, A, cudaMemcpyHostToDevice); cudaMemcpy(d_B, B, cudaMemcpyHostToDevice); cudaMemcpy(C, d_C, cudaMemcpyDeviceToHost); //free stuff

9 Dr Bernhard Kainz Matrix Multiplication Example

10 Dr Bernhard Kainz Memory Model Host Memory CPU

11 Dr Bernhard Kainz Matrix Multiplication Example A lot of memory access with little computations only Memory access is all going to slow global device memory In a block the same memory is needed by multiple threads  use shared memory to load one tile of data, consume the data together, advance to next block

12 Dr Bernhard Kainz Matrix Multiplication Example

13 Dr Bernhard Kainz Matrix Multiplication Example Data loaded BLOCKS_SIZE times! Load tiles, work on tiles, load next tiles …

14 Dr Bernhard Kainz Matrix Multiplication Example Blocksize: TILE_WIDTH x TILE_WIDTH

15 Dr Bernhard Kainz Matrix multiplication problems

16 Dr Bernhard Kainz Matrix multiplication problems Read from another thread before loaded!!

17 Dr Bernhard Kainz Matrix multiplication problems

18 Dr Bernhard Kainz Matrix multiplication problems

19 Dr Bernhard Kainz Matrix multiplication problems

20 Dr Bernhard Kainz Memory statistics: non-tiled

21 Dr Bernhard Kainz Memory statistics: tiled

22 Parallel Reduction Illustrations by Mark Harris, Nvidia

23 Dr Bernhard Kainz Parallel Reduction - Common and important data parallel primitive - Easy to implement in CUDA - Harder to get it right - Serves as a great optimization example - Several different versions possible - Demonstrates several important optimization strategies

24 Dr Bernhard Kainz Parallel Reduction - Tree-based approach used within each thread block - Need to be able to use multiple thread blocks - To process very large arrays - To keep all multiprocessors on the GPU busy - Each thread block reduces a portion of the array - Communicate partial results between thread blocks?

25 Dr Bernhard Kainz Parallel Reduction - If we could synchronize across all thread blocks, could easily reduce very large arrays, right? - Global sync after each block produces its result - Once all blocks reach sync, continue recursively - But CUDA has no global synchronization. Why? - Expensive to build in hardware for GPUs with high processor count - Would force programmer to run fewer blocks (no more than # multiprocessors * # resident blocks / multiprocessor) to avoid deadlock, which may reduce overall efficiency - Solution: decompose into multiple kernels - Kernel launch serves as a global synchronization point - Kernel launch has negligible HW overhead, low SW overhead

26 Dr Bernhard Kainz Parallel Reduction - Avoid global sync by decomposing computation into multiple kernel invocations - In the case of reductions, code for all levels is the same - Recursive kernel invocation

27 Dr Bernhard Kainz Parallel Reduction Your turn: - Take two memory elements. Add them together, put one in your pocket. - Take the memory element from your neighbour on your left on my command. - Add the two numbers (1) together and write down result in “Step 1”, put the other memory element in your pocket. - Take the memory element from your the next neighbour on your left who has still a memory element. - Add the two numbers together and write down result in “Step 2”, other one in pocket. - Continue on my command until only he most right column has memory elements. - Pass down the column to the next neighbour with a memory element and add numbers together, write in next empty Step “field”, put other element away. - Continue until only one thread (student) is left = result.

28 Dr Bernhard Kainz Parallel Reduction – Interleaved Addressing

29 Dr Bernhard Kainz Parallel Reduction - We should strive to reach GPU peak performance - Choose the right metric: - GFLOP/s: for compute-bound kernels - Bandwidth: for memory-bound kernels - Reductions have very low arithmetic intensity - 1 flop per element loaded (bandwidth-optimal) - Therefore we should strive for peak bandwidth - Will use G80 for this example bit memory interface, 900 MHz DDR * 1800 / 8 = 86.4 GB/s

30 Dr Bernhard Kainz Parallel Reduction Many optimizations possible, the one we did is the least efficient Very good discussion: On G80 architecture

31 Examples and applications

32 Dr Bernhard Kainz Real-time optical flow

33 Dr Bernhard Kainz Real time medical image analysis and visualization

34 Dr Bernhard Kainz KinectFusion Developed at

35 Dr Bernhard Kainz KinectFusion

36 Dr Bernhard Kainz 15 fully-funded PhD places for graduates in physics, chemistry, biology, engineering and mathematics - Integrated cross-institutional training programme (MRes + 3-year PhD) Research projects available for: - Image acquisition & reconstruction - Imaging chemistry & biology - Image computing & computational modelling Apply by 4 January 2016: Contact: Medical Imaging EPSRC Centre for Doctoral Training

37 GPU programming Dr. Bernhard Kainz