1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Speed, Accurate and Efficient way to identify the DNA.
Optimization on Kepler Zehuan Wang
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CS 290H 7 November Introduction to multigrid methods
CSCI-455/552 Introduction to High Performance Computing Lecture 26.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Numerical Algorithms • Matrix multiplication
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Demo of running CUDA programs on GPU and potential speed-up over CPU ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 10, 2011.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
An Introduction to Programming with CUDA Paul Richmond
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
Synchronization These notes introduce:
Sunpyo Hong, Hyesoon Kim
Martin Kruliš by Martin Kruliš (v1.0)1.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Chapter 15: Recursion. Recursive Definitions Recursion: solving a problem by reducing it to smaller versions of itself – Provides a powerful way to solve.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Computer Engg, IIT(BHU)
GPU Computing CIS-543 Lecture 10: Streams and Events
CUDA Programming Model
Computer Engg, IIT(BHU)
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
MASS CUDA Performance Analysis and Improvement
Numerical Algorithms • Parallelizing matrix multiplication
Stencil Quiz questions
Stencil Quiz questions
Notes on Assignment 3 OpenMP Stencil Pattern
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Oct 14, 2014 slides6b.ppt 1.
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Jan 28,
Introduction to High Performance Computing Lecture 16
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson StencilPattern.ppt Oct 14,
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
Synchronization These notes introduce:
6- General Purpose GPU Programming
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho
Presentation transcript:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA dynamic parallelism facility introduced in the last set of slides and provides code examples and an application

2 CUDA Dynamic Parallelism A facility introduced by NVIDIA in their GK110 chip/architecture and embodied in our new K20 GPU server coit-grid08.uncc.edu Allows a kernel to call another kernel from within it without returning the host. Each such kernel call has the same calling construction - the grid and block sizes and dimensions are set at the time of the call. Facility allows computations to be done with dynamically altering grid structures and recursion, to suit the computation. For example allows a 2/3D simulation mesh to be non-uniform with increased precision at places of interest, see previous slides.

3 Host code kernel1 >>(…) ; __global__ void kernel1 (…) kernel2 >>(…) ; return 0; Dynamic Parallelism __global__ void kernel2 (…) kernel3 >>(…) ; return 0; Kernels calling other kernels Notice the kernel call is a standard syntax and allows each kernel call to have different grid/block structures Device code Nested depth limited by memory and <= 63 or 64

4 Host code kernel1 >>(…) ; __global__ void kernel1 (…) kernel1 >>(…) ; return 0; Dynamic Parallelism Apparently recursion is allowed (To confirm) Device code Question: How do you get out of an infinite loop in recursion?

5 Derived from Fig 20.4 of Kirk and Hwu 2 nd Ed. Grid A (Parent) Grid B (Child) Grid B launch Grid A threads Grid B completes Host (CPU) thread Parent-Child Launch Nesting Grid A launch Time Grid B threads Grid A completes “Implicit synchronization between parent and child forcing parent to wait until all children exit before it can exit.”

6 Kernel Execution configuration re-visited Derived from CUDA 5 Toolkit documentation* The execution configuration is specified by: kernel >> G specifies dimension and size of grid, B specifies dimension and size of each block Ns specifies number of bytes in shared memory that is dynamically allocated per block for this call in addition to statically allocated memory; Ns is optional which defaults to 0; S specifies associated stream; S is optional which defaults to 0. * extensions Earlier CUDA versions appear to have three arguments G, B and S?S

7 Streams re-visited Stream: A sequence of operations that execute in-order on GPU Provides for concurrency -- multiple streams can be supported simultaneously on GPU Compute Capability 2.0+ up to 16 CUDA streams* Compute Capability 3.5 up to 32 CUDA streams Stream can be specified in kernel launch as 4th parameter of execution configuration: Kernel >>(…); // Stream 3 If missing, default is 0 0 So technically can launch multiple concurrent kernel from host. Apparently difficult to get more than 4 streams to run concurrently *Some devices, query concurrentKernels device property

8 Application Static Heat Equation/Laplace’s Equation ∂u∂u 2-D Heat equation = With boundary conditions defined, 0 as t tends to infinity. Then get time-independent equation called Laplace’s equation =

9 Solving Static Heat Equation/Laplace’s Equation Finite Difference Method Solve for f over the two- dimensional x-y space. For computer solution, finite difference methods appropriate Two-dimensional solution space “discretized” into large number of solution points.

10

Divide area into fine mesh of points, h i,j. Temperature at an inside point taken to be average of temperatures of four neighboring points. Convenient to describe edges by points. Temperature of each point by iterating the equation: ( 0 < i < n, 0 < j < n) for a fixed number of iterations or until the difference between iterations less than some very small amount. 6.11

Heat Distribution Problem For convenience, edges also represented by points, but having fixed values, and used by computing internal values. 6.12

13 Multigrid Method First, a coarse grid of points used. With these points, iteration process will start to converge quickly. At some stage, number of points increased to include points of coarse grid and extra points between points of coarse grid. Initial values of extra points found by interpolation. Computation continues with this finer grid. Grid can be made finer and finer as computation proceeds, or computation can alternate between fine and coarse grids. Coarser grids take into account distant effects more quickly and provide a good starting point for the next finer grid.

14 Multigrid processor allocation

15 Various strategies V-cycle and W cycles between resolutions Gradually decreasing resolution Could a make an interesting project. There is a mathematical basis behind the method Leads to much faster results that using a single resolution h 2h 4h 8h

Questions

17 Chapter 20 Programming Massively Parallel Processors, 2 nd edition by D. B. Kirk and W-M W. Hwu, Morgan Kaufmann, More Information on CUDA Dynamic Parallelism