CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Outline Reading Data From Files Double Buffering GMAC ECE
Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Exceptional Control Flow Processes Today. Control Flow Processors do only one thing: From startup to shutdown, a CPU simply reads and executes (interprets)
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
ME964 High Performance Computing for Engineering Applications “Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
1 Advanced CUDA Feature Highlights. Homework Assignment #3 Problem 2: Select one of the following questions below. Write a CUDA program that illustrates.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Lecture 10: GPU as part of the PC Architecture.
Cuda Streams Presented by Savitha Parur Venkitachalam.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.
CS6963 L17: Asynchronous Concurrent Execution, Open GL Rendering.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA - 2.
CSS 700: MASS CUDA Parallel‐Computing Library for Multi‐Agent Spatial Simulation Fall Quarter 2014 Nathaniel Hart UW Bothell Computing & Software Systems.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Lecture 25 PC System Architecture PCIe Interconnect
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
Synchronization These notes introduce:
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
Linux Kernel Development Memory Management Pavel Sorokin Gyeongsang National University
CUDA programming Performance considerations (CUDA best practices)
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Prof. Zhang Gang School of Computer Sci. & Tech.
Multi-GPU Programming
GPU Computing CIS-543 Lecture 10: Streams and Events
CUDA Programming Model
Heterogeneous Programming
Basic CUDA Programming
Host-Device Data Transfer
Some things are naturally parallel
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Outline Reading Data From Files Double Buffering GMAC ECE
Chapter 15, Exploring the Digital Domain
More on GPU Programming
CUDA Execution Model – III Streams and Events
CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Synchronization These notes introduce:
Presentation transcript:

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked memory

Page-locked host memory (also called pinned host memory) Page-locked memory is not paged in and out main memory by the OS through paging but will remain resident. Allows: Concurrent host/device memory transfers with kernel operations (Compute capability 2.x) – see next Host memory can be mapped to device address space (Compute capability > 1.0) Memory bandwidth is higher Uses real addresses rather than virtual addresses Does not need to intermediate copy buffering

Note on using page-locked memory Using page-locked memory will reduce memory available to the OS for paging and so need to be careful in allocating it

Allocating page locked memory cudaMallocHost ( void ** ptr, size_t size ) Allocates page-locked host memory that is accessible to device cudaHostAlloc ( void ** ptr, size_t size, unsigned int flags) Allocates page-locked host memory that is accessible to device – seems to have more options

CUDA Streams A CUDA Stream is a sequence of operations (commands) that are executed in order. CUDA streams can be created and executed together and interleaved although the “program order” is always maintained within each stream. Streams proved a mechanism to overlap memory transfer and computations operations in different stream for increased performance if sufficient resources are available.

Creating a stream Done by creating a stream object and associated it with a series of CUDA commands that then becomes the stream. CUDA commands have a stream pointer as an argument: cudaStream_t stream1; cudaStreamCreate(&stream1); cudaMemcpyAsync(…, stream1); MyKernel<<< grid, block, stream1>>>(…); cudaMemcpyAsync(… , stream1); Cannot use regular cudaMemcpy with streams, need asynchronous commands for concurrent operation see next Stream

cudaMemcpyAsync( …, stream) Asynchronous version of cudaMemcpy that copies date to/from host and the device May return before copy complete A stream argument specified. Needs “page-locked” memory

Simply concatenating statements does not work well because of the way the GPU schedules work Page 206 CUDA by Example,

Page 207 CUDA by Example,

Page 208 CUDA by Example

Interleave statements of each stream for(int i=0;I < SIZE;i+= N*2 { // loop over data in chunks // interleave stream 1 and stream 2 cudaMemcpyAsync(dev_a1,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream1); cudaMemcpyAsync(dev_a2,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream2); cudaMemcpyAsync(dev_b1,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream1); cudaMemcpyAsync(dev_b2,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream2); kernel<<<N/256,256,0,stream1>>>(dev_a,dev-b,dev_c); kernel<<<N/256,256,0,stream2>>>(dev_a,dev-b,dev_c); cudaMemcpyAsync(c+1,dev_c1,N*sizeof(int),cudaMemcpyDeviceToHost,stream1); cudaMemcpyAsync(c+1,dev_c2,N*sizeof(int),cudaMemcpyDeviceToHost,stream2); }

Page 210 CUDA by Example

Questions