GPU Computing CIS-543 Lecture 10: Streams and Events

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

Multi-GPU and Stream Programming Kishan Wimalawarne.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Cuda Streams Presented by Savitha Parur Venkitachalam.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
An Introduction to Programming with CUDA Paul Richmond
Martin Kruliš by Martin Kruliš (v1.0)1.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
CUDA - 2.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Processor Architecture
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
Event Sources and Realtime Actions
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
CUDA C/C++ Basics Part 2 - Blocks and Threads
These slides are based on the book:
GPU Computing CIS-543 Lecture 03: Introduction to CUDA
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 2: Computer-System Structures
Chapter 3: Process Concept
Heterogeneous Programming
Multithreaded Programming in Java
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Lecture 5: GPU Compute Architecture
CSCI 315 Operating Systems Design
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Lecture 5: GPU Compute Architecture for the last time
More on GPU Programming
Operating Systems Chapter 5: Input/Output Management
Chapter 2: The Linux System Part 3
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Multithreading.
Multithreaded Programming
Threads Chapter 4.
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
CSE 153 Design of Operating Systems Winter 19
Chapter 2: Computer-System Structures
COMP3221: Microprocessors and Embedded Systems
Synchronization These notes introduce:
GPU Scheduling on the NVIDIA TX2:
Chapter 13: I/O Systems.
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

GPU Computing CIS-543 Lecture 10: Streams and Events Dr. Muhammad Abid, DCIS, PIEAS

Introduction Two levels of concurrency in CUDA C programming: Kernel level concurrency: Many threads execute the same kernel concurrently on the device Improved performance using programming model, execution model, and memory model Grid level concurrency: multiple kernels execute simultaneously on a single device leading to better device utilization. Using CUDA streams to implement grid level concurrency

Introduction to Streams stream refers to a sequence of CUDA operations that execute on a device in the order issued by the host code. A stream: encapsulates these operations, maintains their ordering, permits operations to be queued in the stream to be executed after all preceding operations, and allows for querying the status of queued operations

Introduction to Streams Operations can include: host-device data transfer, kernel launches, and most other commands that are issued by the host but handled by the device. Operations in different streams have no restriction on execution order. By using multiple streams to launch multiple simultaneous kernels, you can implement grid level concurrency.

Introduction to Streams In general Concurrency on device = Asynchronous operations + Streams While from a software point of view CUDA operations in different streams run concurrently; that may not always be the case on physical hardware. Depending on PCIe bus contention or the availability of device resources, different CUDA streams may still need to wait for each other in order to complete. While from a software point of view CUDA operations in different streams run concurrently; that may not always be the case on physical hardware. Depending on PCIe bus contention or the availability of per-SM resources, different CUDA streams may still need to wait for each other in order to complete.

CUDA Streams All CUDA operations (both kernels and data transfers) either explicitly or implicitly run in a stream. Two types of streams: Implicitly declared stream (NULL stream) Explicitly declared stream (non-NULL stream) The NULL stream is the default stream that kernel launches and data transfers use if you do not explicitly specify a stream. While from a software point of view CUDA operations in different streams run concurrently; that may not always be the case on physical hardware. Depending on PCIe bus contention or the availability of per-SM resources, different CUDA streams may still need to wait for each other in order to complete.

CUDA Streams Non-null streams: Overlapping different CUDA operations: explicitly created and managed. Overlapping different CUDA operations: must use non-null streams. Asynchronous, stream-based kernel launches and data transfers enable the following types of coarse-grain concurrency: Overlapped host computation and device computation Overlapped host computation and host-device data transfer While from a software point of view CUDA operations in different streams run concurrently; that may not always be the case on physical hardware. Depending on PCIe bus contention or the availability of per-SM resources, different CUDA streams may still need to wait for each other in order to complete.

CUDA Streams Issued to default stream: Overlapped host-device data transfer and device computation Concurrent device computation Issued to default stream: cudaMemcpy(..., cudaMemcpyHostToDevice); kernel<<<grid, block>>>(...); cudaMemcpy(..., cudaMemcpyDeviceToHost); While from a software point of view CUDA operations in different streams run concurrently; that may not always be the case on physical hardware. Depending on PCIe bus contention or the availability of per-SM resources, different CUDA streams may still need to wait for each other in order to complete.

CUDA Streams Creating streams: cudaStream_t stream; cudaStreamCreate(&stream); cudaStreamDestroy(stream); cudaError_t cudaStreamSynchronize(cudaStream_t stream); cudaError_t cudaStreamQuery(cudaStream_t stream); While from a software point of view CUDA operations in different streams run concurrently; that may not always be the case on physical hardware. Depending on PCIe bus contention or the availability of per-SM resources, different CUDA streams may still need to wait for each other in order to complete.

CUDA Streams cudaError_t cudaStreamCreateWithPriority(cudaStream_t* pStream, unsigned int flags, int priority); cudaError_t cudaDeviceGetStreamPriorityRange(int *leastPriority, int *greatestPriority); Kernels queued to a higher priority stream may preempt work already executing in a low priority stream. Stream priorities have no effect on data transfer operations, only on compute kernels. While from a software point of view CUDA operations in different streams run concurrently; that may not always be the case on physical hardware. Depending on PCIe bus contention or the availability of per-SM resources, different CUDA streams may still need to wait for each other in order to complete.

Introduction to Streams Notice: data transfer operations are not executed concurrently due to shared PCIe bus. Devices with a duplex PCIe bus can overlap two data transfers, but they must be in different streams and in different directions. You might notice that the data transfer operations are not executed concurrently, even though they are issued in separate streams. This contention is caused by a shared resource: the PCIe bus. While these operations are independent from the point-of-view of the programming model, because they share a common hardware resource their execution must be serialized. Devices with a duplex PCIe bus can overlap two data transfers, but they must be in different streams and in different directions. Observe that data transfer from the host to the device in one stream is overlapped with data transfer from the device to the host in another.

Stream Scheduling Conceptually, all streams can run simultaneously. However, this is not always the reality when mapping streams to physical hardware While from a software point of view CUDA operations in different streams run concurrently; that may not always be the case on physical hardware. Depending on PCIe bus contention or the availability of per-SM resources, different CUDA streams may still need to wait for each other in order to complete.

CUDA Events Essentially a marker in a CUDA stream associated with a certain point in the fl ow of operations in that stream. Events applications: Synchronize stream execution Monitor device progress Measuring elapsed time

CUDA Events The CUDA API provides functions that allow you to insert events at any point in a stream as well as query for event completion. An event recorded on a given stream will only be completed when all preceding operations in the same stream have completed.

CUDA Events Creating event Inserting event to a stream: cudaEvent_t event; cudaEventCreate(&event); cudaEventDestroy(event); Inserting event to a stream: cudaError_t cudaEventRecord(cudaEvent_t event, cudaStream_t stream = 0); cudaError_t cudaEventSynchronize(cudaEvent_t event); cudaError_t cudaEventQuery(cudaEvent_t event);

Configurable Events customizing the behavior and properties of events: cudaError_t cudaEventCreateWithFlags(cudaEvent_t* event, unsigned int flags); Valid flags include: cudaEventDefault: Default behavior cudaEventBlockingSync: calling thread goes to sleeping state cudaEventDisableTiming: no recording timing data cudaEventInterprocess: inter-process event The fl ag cudaEventBlockingSync specifi es that synchronizing on this event with cudaEventSynchronize will block the calling thread. The default behavior of cudaEventSynchronize is to spin on the event, using CPU cycles to constantly check the event’s status. With cudaEventBlockingSync set, the calling thread instead gives up the core it is running on to another thread or process by going to sleep until the event is satisfi ed. While this can lead to fewer wasted CPU cycles if other useful work can be done, it also can lead to longer latencies between an event being satisfi ed and the calling thread being activated. Passing cudaEventDisableTiming indicates that the created event is only used for synchronization and does not need to record timing data. Removing the overhead of taking timestamps improves the performance of calls to cudaStreamWaitEvent and cudaEventQuery. The fl ag cudaEventInterprocess indicates that the created event may be used as an inter-process event.

Stream Synchronization Two types of streams: Asynchronous streams (non-NULL streams) Synchronous streams (the NULL/default stream) A non-null stream is an asynchronous stream with respect to the host NULL-stream, declared implicitly, is a synchronous stream with respect to the host. A non-null stream is an asynchronous stream with respect to the host; all operations applied to it do not block host execution. On the other hand, the NULL-stream, declared implicitly, is a synchronous stream with respect to the host. Most operations added to the NULL-stream cause the host to block on all preceding operations, the main exception being kernel launches.

Stream Synchronization Two typpes of Non-NULL streams: Blocking streams Non-blocking streams Blocking streams: created using cudaStreamCreate() the NULL stream can block operations in it The NULL stream synchronizes with all other blocking streams in the same CUDA context. The resulting behavior is that kernel_2 does not start executing on the GPU until kernel_1 completes, and kernel_3 does not start until kernel_2 completes. Note that from the host’s perspective, each kernel launch is still asynchronous. kernel_1<<<1, 1, 0, stream_1>>>(); kernel_2<<<1, 1>>>(); kernel_3<<<1, 1, 0, stream_2>>>();

Stream Synchronization Nonblocking streams: created using cudaStreamCreateWithFlags() with cudaStreamNonBlocking flag the NULL stream cannot block operations in it Assume stream_1 and stream_2 are nonblocking streams: none of the kernel executions would be blocked kernel_1<<<1, 1, 0, stream_1>>>(); kernel_2<<<1, 1>>>(); kernel_3<<<1, 1, 0, stream_2>>>();

Inter-stream dependencies cudaError_t cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event); cudaStreamWaitEvent causes the specified stream to wait on the specified event before executing any operations queued in stream after the call to cudaStreamWaitEvent. The event may be associated with either the same stream, or a different stream.

Stream Callbacks A stream callback is another type of operation that can be queued to a CUDA stream. After all preceding operations ina a stream have completed, a host-side function specified by the stream callback is called by the CUDA runtime. This function is provided by the application. allows arbitrary host-side logic to be inserted into CUDA streams. another CPU-GPU synchronization mechanism. first example of GPU operations creating work on the host

Stream Callbacks Queuing/ Registering a callback host function in a stream: cudaError_t cudaStreamAddCallback (cudaStream_t stream, cudaStreamCallback_t callback, void *userData, unsigned int flags); A callback is executed only once per cudaStreamAddCallback, and will block other work queued in a stream after it until the callback function completes.

Stream Callbacks Two restrictions to callback functions: No CUDA API function can be called from a callback function. No synchronization can be performed within the callback function.