Martin Kruliš 3. 12. 2015 by Martin Kruliš (v1.0)1.

Slides:



Advertisements
Similar presentations
Multi-GPU and Stream Programming Kishan Wimalawarne.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CUDA and the Memory Model (Part II). Code executed on GPU.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Martin Kruliš by Martin Kruliš (v1.0)1.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Advanced / Other Programming Models Sathish Vadhiyar.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
CUDA Dynamic Parallelism
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Single Instruction Multiple Threads
CS/EE 217 – GPU Architecture and Parallel Programming
These slides are based on the book:
GPU Computing CIS-543 Lecture 10: Streams and Events
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Parallel Programming By J. H. Wang May 2, 2017.
Heterogeneous Programming
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Lecture 5: GPU Compute Architecture
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Threads, SMP, and Microkernels
Lecture 5: GPU Compute Architecture for the last time
Symmetric Multiprocessing (SMP)
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Synchronization These notes introduce:
6- General Purpose GPU Programming
CMSC 202 Threads.
Presentation transcript:

Martin Kruliš by Martin Kruliš (v1.0)1

 Different Types of Parallelism ◦ Task parallelism  Problem is divided to tasks, which are processed independently ◦ Data parallelism  Same operation is performed over many data items ◦ Pipeline parallelism  Data are flowing though a sequence (or oriented graph) of stages, which operate concurrently ◦ Other types of parallelism  Event driven, … by Martin Kruliš (v1.0)2

 Parallelism in GPU ◦ Data parallelism  The same kernel is executed by many threads  Thread process one data item ◦ Limited task parallelism  Multiple kernels executed simultaneously (since Fermi)  At most as many kernels as SMPs ◦ But we do not have  Any means of kernel-wide synchronization (barrier)  Any guarantees that two blocks/kernels will actually run concurrently by Martin Kruliš (v1.0)3

 Unsuitable Problems for GPUs ◦ Processing irregular data structures  Trees, graphs, … ◦ Regular structures with irregular processing workload  Difficult simulations, iterative approximations ◦ Iterative tasks with explicit synchronization ◦ Pipeline-oriented tasks with many simple stages by Martin Kruliš (v1.0)4

 Solutions ◦ Iterative kernel execution  Usually applicable only for cases when there are none or few dependencies between the subsequent kernels  The state (or most of it) is kept on the GPU ◦ Mapping irregular structures to regular grids  May be too fine/coarse grained  Not always possible ◦ 2-phase Algorithms  First phase determines the amount of work (items, …)  Second phase process tasks mapped by first phase by Martin Kruliš (v1.0)5

 Persistent Threads ◦ Single kernel executed so there are as many thread-blocks as SMPs ◦ Workload is generated and processed on the GPU  Using global queue implemented by atomic operations ◦ Possible problems  The kernel may be quite complex  There are no guarantees that two blocks will ever run concurrently  Possibility of creating deadlock  The synchronization overhead is significant  But it might be lower than host-based synchronization by Martin Kruliš (v1.0)6

 CUDA Dynamic Parallelism ◦ New feature presented in CC 3.5 (Kepler) ◦ GPU threads are allowed to launch another kernels by Martin Kruliš (v1.0)7

 Dynamic Parallelism Purpose ◦ The device does not need to synchronize with host to issue new work to the device ◦ Irregular parallelism may be expressed more easily by Martin Kruliš (v1.0)8

 How It Works ◦ Portions of CUDA runtime are ported to device side  Kernel execution  Device synchronization  Streams, events, and async memory operations ◦ Kernel launches are asynchronous  No guarantee the child kernel starts immediately  Synchronization points may cause context switch  Entire blocks are switched on a SMP ◦ Block-wise locality of resources  Streams and events are shared in thread block by Martin Kruliš (v1.0)9

 Example __global__ void child_launch(int *data) { data[threadIdx.x] = data[threadIdx.x]+1; } __global__ void parent_launch(int *data) { data[threadIdx.x] = threadIdx.x; __syncthreads(); if (threadIdx.x == 0) { child_launch >>(data); cudaDeviceSynchronize(); } __syncthreads(); } void host_launch(int *data) { parent_launch >>(data); } by Martin Kruliš (v1.0)10 Synchronization does not have to be invoked by all threads Thread 0 invokes a grid of child threads Device synchronization does not synchronize threads in the block

 Compilation Issues ◦ CC 3.5 or higher ◦ -rdc=true ◦ Dynamic linking  Before host linking  With cudadevrt by Martin Kruliš (v1.0)11

by Martin Kruliš (v1.0)12