CUDA Execution Model – III Streams and Events

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Outline Reading Data From Files Double Buffering GMAC ECE
Multi-GPU and Stream Programming Kishan Wimalawarne.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
ME964 High Performance Computing for Engineering Applications “Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
Cuda Streams Presented by Savitha Parur Venkitachalam.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
CUDA - 2.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CUDA C/C++ Basics Part 2 - Blocks and Threads
Multi-GPU Programming
Module 12: I/O Systems I/O hardware Application I/O Interface
Processes and threads.
GPU Computing CIS-543 Lecture 10: Streams and Events
CUDA Programming Model
Sathish Vadhiyar Parallel Programming
Heterogeneous Programming
Chapter 4: Multithreaded Programming
Basic CUDA Programming
Host-Device Data Transfer
CSCI 315 Operating Systems Design
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
More on GPU Programming
I/O Systems I/O Hardware Application I/O Interface
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Threads Chapter 4.
Measuring Performance
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Measuring Performance
Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu
GPU Scheduling on the NVIDIA TX2:
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

CUDA Execution Model – III Streams and Events ©Sudhakar Yalamanchili unless otherwise noted

Objectives Become familiar with CUDA streams Use of CUDA events Usage to overlap GPU computation and data transfers Usage to improve GPU utilization: multi-kernel execution on the GPU (for later devices) Use of CUDA events Usage in timing of GPU kernel execution

Reading Assignment NVIDIA presentation and links in class Resources page This presentation is in part a summary of that NVIDIA presentation See example programs (look on class Resources page) CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract

Offloading Activities via Streams Kernel Management Unit (Device) CUDA streams API operation Host Processor Kernel dispatch to SMs HW Queues CUDA Runtime commands by host issued through streams Operationally, a stream is a FIFO structure Commands in the same stream executed sequentially Commands in different streams may be executed concurrently Streams are mapped to hardware queues in the device Multiple streams mapped to each queue  serializes commands Default stream is stream 0 Need not be explicitly specified

Single Stream Synchronization Model CUDA stream cudamalloc() cudamemcpy() Kernel<<<>>> Host Processor GPU Synchronous API calls Enqueue call into the stream and wait for completion Asynchronous API calls Enqueue call into the stream and return Host execution is overlapped with call execution Kernel calls are asynchronous How do we know when are all calls are completed?

Single Stream Synchronization Model cudamalloc() cudamemcpy() Kernel<<<>>> Host Processor GPU cudastreamSynchronize() Host waits until the stream is empty Performance gain primarily with host/device overlap

Creating Streams cudaStream_t stream; cudaStreamCreate(&stream); Declares a stream handle  cudaStreamCreate(&stream);  Allocates a stream  cudaStreamDestroy(stream);  Deallocates a stream  Synchronizes host until work in stream has completed  The stream ID is an argument to API calls kernel<<< blocks , threads, stream>>>(); Show the vector addition code with streams

CUDA Events CUDA Stream cudamalloc() cudamemcpy() Kernel<<<>>> Host Processor GPU Cannot get accurate timing information with CPU timers  need to perform timing on the GPU! Create events used to record timestamps Place event into the stream Timestamps will be recorded on the GPU when the event is processed at the GPU

CUDA Events cudaEvent_t start, stop; cudaEventCreate(&start,); CUDA stream cudamalloc() cudamemcpy() Kernel<<<>>> Host Processor GPU cudaEvent_t start, stop; cudaEventCreate(&start,); cudaEventCreate(&stop); cudaEventRecord(start, stream); cudaEventRecord(stop, stream); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start); Create Events Place events in the stream Ensure stop event occurred Measure time & free

Event Synchronization cudaEVentREcord() Host Processor GPU Need to ensure that this event is executed before reading the counters Use cudaEventSynchronize() on host Code Example:

See vector add example in repository Example Code Segment cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); cudaEventRecord(start); saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); cudaEventRecord(stop); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); cudaEventSynchronize(stop); float milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); cudaEventDestroy (start); cudaEventDestroy(stop); See vector add example in repository Example from https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/

Asynchronous Communication cudamalloc() cudamemcpy() Kernel<<<>>> Host Processor GPU cudastreamSynchronize() Like kernel calls we want communication to be asynchronous Use DMA engines that operate concurrently with the host Would like to launch a cudamemcpy() and let the host continue DMA engines operate with physical addresses and we have a host with virtual memory and paged address spaces  operate with pinned memory

Using Pinned Memory Code Example: Pageable vs. pinned memory GPU Host memory page DMA Transfer pinned DRAM Pageable vs. pinned memory Avoid page faults Can use physical address  direct memory access (DMA) Allocate pinned memory: cudaHostAlloc(); Overlap DMA with host execution Required by cudaMemcpyAsync( dst, src, size, dir, stream); Show the single stream code example for vector add from the repo Code Example:

Code Template ….. cudaStreamCreate(&stream1); -- create streams cudaMallocHost(&a, sizeof(int) * size) -- pin memory for each array cudaMalloc(&dev_a, sizeof(int) * size) -- allocate device memory cudaMemcpyAsync(,,,,steam1) -- asynchornous copy call vecAdd<<<GRID_SIZE, CTA_SIZE, 0, stream1>>>( ); -- kernel call …… cudaDeviceSynchronize() -- wait for kernel to complete cudaMemcpyAsync(,,, stream1) -- copy back data cudaStreamDestroy(stream1); -- remove stream cudaFree(dev_a); cudaFreeHost(a) -- free host and device memory

Performance Optimizations With asynchronous operation how can we improve performance Multiple streams! Pipelining and concurrency across communication an computation.

Pipelined Operation: The Opportunity GPU K1 H2D2 pinned D2H0 Pipelining execution of stream commands H2D0 K0 H2D0 Need for asynchrony between communication and computation H2D1 K1 H2D1 Degree of concurrency depends on GPU family, i.e., compute capability H2D2 K2 H2D2

Pipelined Operation: A Functional View Break up a single problem into multiple chunks, or Concurrency across multiple kernels GPU K H2D pinned D2H

Multiple Streams Stream 0 cudamalloc() cudamemcpy() Kernel<<<>>> Host Processor GPU Stream 1 cudamalloc() cudamemcpy() Kernel<<<>>> Lets look at the vector add implementation parallelized across streams

Stream Level Concurrency Break the vector up into chunks: a0, a1, b0, b1, c0, etc. Compute vector sum for each chunk separately e.g., a0[i] + b0[i] = c0[i], across the chunk Interleave chunk computations across the two streams Kernel for each chunk K0 K1 Vector a Stream 0 a0 How do we distribute API calls across streams? Vector b GPU Show the simple straightforward code that does not improve performance much b0 Vector c Code Example: c0 Stream 1 chunk

Using Multiple Streams Hardware data transfer engines and the kernel scheduling queues are distinct entities Must maintain correct ordering of API calls between data transfer and kernel execution They are placed in two different hardware queues  be conservative! The driver sees a sequence of calls Does the sequence matter? Shw

Maximizing Concurrency copy a0 copy b0 kernel call k0 copy c0 copy a1 copy b1 kernel call k1 copy c1 … copy a0 copy a1 copy b0 copy b1 kernel call k0 kernel call k1 copy c0 copy c1 … Stream 0 Application runtime call sequence on host … Stream 1 Driver sees serial sequence of calls CUDA Driver CUDA Driver Hardware copy queue a0 Kernel dispatch queue a0 b0 a1 k0 b0 Hardware queues c0 b1 k0 Time a1 c0 k1 b1 c1 k1 Dependencies that must be honored c1 Example from “CUDA by Example,” J. Sanders and E. Kandrot, Chapter 10.6-10.7

Lessons Many subtle performance issues in using multiple streams The number and form of stream support depends on the GPU generation Check device properties for more information on GPU capabilities Code Example:

Questions?