Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Intermediate GPGPU Programming in CUDA
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Grids, Blocks, and Threads
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Using The CUDA Programming Model 1 Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin – Eau Claire.
CUDA Programming. Floating Point Operations for the CPU and the GPU.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
CIS 565 Fall 2011 Qing Sun
CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
1 The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CUDA C/C++ Basics Part 2 - Blocks and Threads
Slides from “PMPP” book
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
CUDA Grids, Blocks, and Threads
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM

Training Program on GPU Programming with CUDA Sanath Jayasena CUDA Teaching UoM Day 1, Session 2 CUDA Programming Model CUDA Threads

Outline for Day 1 Session 2 CUDA Programming Model, CUDA Threads Data Parallelism CUDA Program Structure Memory Model & Data Transfer (Brief) Kernel Functions & Threading (Discussion with Example: Matrix Multiplication) July-Aug 20113CUDA Training Program

Data Parallelism – A problem/program property – Many arithmetic operations can be safely performed on the data structures simultaneously – Example: matrix multiplication (next slide) CUDA devices can exploit data parallelism to accelerate execution of applications July-Aug 2011CUDA Training Program4

Example: Matrix Multiplication July-Aug 2011CUDA Training Program 5 MP N width P = M · N Each element in P is computed as dot product between a row of M and a column of N All elements in P can be computed independently and simultaneously

CUDA Program Structure A CUDA program consists of one or more phases executed on either the host (CPU) or a device (GPU), supplied as a single source code Little or no data parallelism  host code – ANSI C, compiled with standard compiler Significant data parallelism  device code – Extended ANSI C to specify kernels, data structs NVIDIA C Complier separates the two and … July-Aug 2011CUDA Training Program6

Execution of a CUDA Program July-Aug 2011CUDA Training Program7

Execution of a CUDA Program Execution starts with host (CPU) When a kernel is invoked, execution moves to the device (GPU) – A large number of threads generated – Grid : collection of all threads generated by kernel – (Previous slide shows two grids of threads) Once all threads in a grid complete execution, the grid terminates and execution continues on the host July-Aug 2011CUDA Training Program8

Example: Matrix Multiplication int main (void) { 1. // Allocate and initialize matrices M, N, P // I/O to read the input matrices M and N …. 2. // M * N on the device MatrixMulOnDevice (M, N, P, width); 3. // I/O to write the output matrix P // Free matrices M, N, P … return 0; } July-Aug 2011CUDA Training Program9 A simple CUDA host code skeleton for matrix multiplication

CUDA Device Memory Model Host, devices have separate memory spaces – E.g., hardware cards with their own DRAM To execute a kernel on a device – Need to allocate memory on device – Transfer data: host memory  device memory After device execution – Transfer results: device memory  host memory – Free device memory no longer needed July-Aug 2011CUDA Training Program10

CUDA Device Memory Model July-Aug 2011CUDA Training Program11

CUDA API : Memory Mgt. July-Aug 2011CUDA Training Program12

CUDA API : Memory Mgt. Example float *Md; int size = Width * Width * sizeof(float); cudaMalloc((void**)&Md, size); … cudaFree(Md); July-Aug 2011CUDA Training Program13

CUDA API : Data Transfer July-Aug 2011CUDA Training Program14

Example: Matrix Multiplication July-Aug 2011CUDA Training Program15

Kernel Functions & Threading A kernel function specifies the code to be executed by all threads of a parallel phase – All threads of a parallel phase execute the same code  single-program multiple-data (SPMD), a popular programming style for parallel computing Need a mechanism to – Allow threads to distinguish themselves – Direct themselves to specific parts of data they are supposed to work on July-Aug 2011CUDA Training Program16

Kernel Functions & Threading Keywords “threadIdx.x” and “threadIdx.y” – Thread indices of a thread – Allow a thread to identify itself at runtime (by accessing hardware registers associated with it) Can refer a thread as Thread threadIdx.x,threadIdx.y Thread indices reflect a multi-dimensional organization for threads July-Aug 2011CUDA Training Program17

Example: Matrix Multiplication Kernel July-Aug 2011CUDA Training Program18 See next slide for more details on accessing relevant data

Thread Indices & Accessing Data Relevant to a Thread July-Aug 2011CUDA Training Program 19 MdPd Nd width tx ty x y Pd row 0row 1 ty * width tx How matrix Pd would be laid out in memory (as it is a 1-D array) Each thread uses tx, ty to identify the relevant row of Md, column of Nd and the element of Pd in the for loop E.g., Thread 2,3 will perform dot product between row 2 of Md and column 3 of Nd and write the result into element (2,3) of Pd

Threading & Grids When a kernel is invoked/launched, it is executed as a grid of parallel threads A CUDA thread grid can have millions of lightweight GPU threads per kernel invocation – To fully utilize hardware  enough threads required  large data parallelism required Threads in a grid has a two-level hierarchy – A grid consists of 1 or more thread blocks – All blocks in a grid have same # of threads July-Aug 2011CUDA Training Program20

CUDA Thread Organization July-Aug 2011CUDA Training Program21

Threading with Grids & Blocks Each thread block has a unique 2-D coordinate given by CUDA keywords “blockIdx.x” and “blockIdx.y” – All blocks must have the same structure, thread # Each block has a 3-D array of threads up to a total of 1024 threads max – Coordinates of threads in a block are defined by indices: threadIdx.x, threadIdx.y, threadIdx.z – (Not all apps will use all 3 dimensions) July-Aug 2011CUDA Training Program22

Our Example: Matrix Multiplication The kernel is shown 5 slides before (slide 18) – This can only use one thread block – The block is organized as a 2D-array The code can compute a product matrix Pd of only up to 1024 elements – As a block can have a max of 1024 threads – Each thread computes one element in Pd – Is this sufficient / acceptable? July-Aug 2011CUDA Training Program23

Our Example: Matrix Multiplication When host code invokes the kernel, the grid and block dimensions are set by passing them as parameters Example // Setup the execution configuration dim3 dimBlock(16, 16, 1); //Width=16, as example dim3 dimGrid(1, 1, 1); //last 1 ignored // Launch the device computation threads! MatrixMulKernel >>(Md,Nd,Pd,16); July-Aug 2011CUDA Training Program24

Here is an Exercise… Implement Matrix Multiplication – Execute it with different matrix dimensions using (a) CPU only, (b) GPUs and (c) GPUs with different grid/block organizations Fill a table like the following July-Aug 2011CUDA Training Program25 Dimensions (M, N)CPU time (s) GPU time (s) Speedup [400,800], [400, 400] [800,1600], [800, 800] …. [2400,4800], [2400, 4800]

Conclusion We discussed CUDA Programming Model and CUDA Thread Basics – Data Parallelism – CUDA Program Structure – Memory Model & Data Transfer (briefly) – Kernel Functions & Threading – (Discussion with Example: Matrix Multiplication) July-Aug 2011CUDA Training Program26

References for this Session Chapter 2 of: D. Kirk and W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann, 2010 Chapters 4-5 of: E. Kandrot and J. Sanders, CUDA by Example, Addison-Wesley, 2010 Chapter 2 of: NVIDIA CUDA C Programming Guide, V. 3.2/4.0, NVIDIA Corp., July-Aug 2011CUDA Training Program27