Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0) 15. 12. 2017.

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
Computer Systems/Operating Systems - Class 8
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Martin Kruliš by Martin Kruliš (v1.0)1.
Extracted directly from:
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Advanced / Other Programming Models Sathish Vadhiyar.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
CUDA - 2.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
CUDA Dynamic Parallelism
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Single Instruction Multiple Threads
Chapter 4 – Thread Concepts
These slides are based on the book:
Multi-GPU Programming
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Chapter 4: Threads.
GPU Computing CIS-543 Lecture 10: Streams and Events
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Process Management Process Concept Why only the global variables?
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Parallel Programming By J. H. Wang May 2, 2017.
Heterogeneous Programming
Chapter 4 – Thread Concepts
3- Parallel Programming Models
Parallel and Distributed Simulation Techniques
Lecture 2: Intro to the simd lifestyle and GPU internals
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Lecture 5: GPU Compute Architecture
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Threads, SMP, and Microkernels
Lecture 5: GPU Compute Architecture for the last time
Symmetric Multiprocessing (SMP)
Parallel Computation Patterns (Reduction)
Operating System 4 THREADS, SMP AND MICROKERNELS
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Convolution Layer Optimization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Synchronization These notes introduce:
GPU Scheduling on the NVIDIA TX2:
6- General Purpose GPU Programming
CMSC 202 Threads.
GPU Architectures and CUDA in More Detail
Presentation transcript:

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0) 15. 12. 2017

Types of Parallelism Different Types of Parallelism Task parallelism Problem is divided to tasks, which are processed independently Data parallelism Same operation is performed over many data items Pipeline parallelism Data are flowing though a sequence (or oriented graph) of stages, which operate concurrently Other types of parallelism Event driven, … by Martin Kruliš (v1.0) 15. 12. 2017

GPU Execution Model Parallelism in GPU Data parallelism The same kernel is executed by many threads Thread process one data item Limited task parallelism Multiple kernels executed simultaneously (since Fermi) At most as many kernels as SMPs But we do not have Any means of kernel-wide synchronization (barrier) Any guarantees that two blocks/kernels will actually run concurrently by Martin Kruliš (v1.0) 15. 12. 2017

Problematic Cases Unsuitable Problems for GPUs Processing irregular data structures Trees, graphs, … Regular structures with irregular processing workload Difficult simulations, iterative approximations Iterative tasks with explicit synchronization Pipeline-oriented tasks with many simple stages by Martin Kruliš (v1.0) 15. 12. 2017

Problematic Cases Solutions Iterative kernel execution Usually applicable only for cases when there are none or few dependencies between the subsequent kernels The state (or most of it) is kept on the GPU Mapping irregular structures to regular grids May be too fine/coarse grained Not always possible 2-phase Algorithms First phase determines the amount of work (items, …) Second phase process tasks mapped by first phase by Martin Kruliš (v1.0) 15. 12. 2017

Persistent Threads Persistent Threads Single kernel executed so there are as many thread-blocks as SMPs Workload is generated and processed on the GPU Using global queue implemented by atomic operations Possible problems The kernel may be quite complex There are no guarantees that two blocks will ever run concurrently Possibility of creating deadlock The synchronization overhead is significant But it might be lower than host-based synchronization by Martin Kruliš (v1.0) 15. 12. 2017

Dynamic Parallelism CUDA Dynamic Parallelism New feature presented in CC 3.5 (Kepler) GPU threads are allowed to launch another kernels by Martin Kruliš (v1.0) 15. 12. 2017

Dynamic Parallelism Dynamic Parallelism Purpose The device does not need to synchronize with host to issue new work to the device Irregular parallelism may be expressed more easily by Martin Kruliš (v1.0) 15. 12. 2017

Dynamic Parallelism How It Works Portions of CUDA runtime are ported to device side Kernel execution Device synchronization Streams, events, and async memory operations Kernel launches are asynchronous No guarantee the child kernel starts immediately Synchronization points may cause context switch Entire blocks are switched on a SMP Block-wise locality of resources Streams and events are shared in thread block by Martin Kruliš (v1.0) 15. 12. 2017

Dynamic Parallelism Example Thread 0 invokes a grid of child threads __global__ void child_launch(int *data) { data[threadIdx.x] = data[threadIdx.x]+1; } __global__ void parent_launch(int *data) { data[threadIdx.x] = threadIdx.x; __syncthreads(); if (threadIdx.x == 0) { child_launch<<< 1, 256 >>>(data); cudaDeviceSynchronize(); void host_launch(int *data) { parent_launch<<< 1, 256 >>>(data); Thread 0 invokes a grid of child threads Synchronization does not have to be invoked by all threads Device synchronization does not synchronize threads in the block by Martin Kruliš (v1.0) 15. 12. 2017

Dynamic Parallelism Compilation Issues CC 3.5 or higher -rdc=true Dynamic linking Before host linking With cudadevrt by Martin Kruliš (v1.0) 15. 12. 2017

Discussion by Martin Kruliš (v1.0) 15. 12. 2017