Cuda Streams Presented by Savitha Parur Venkitachalam.

Slides:

Advertisements

Similar presentations

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Advertisements

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Outline Reading Data From Files Double Buffering GMAC ECE

Multi-GPU and Stream Programming Kishan Wimalawarne.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

ME964 High Performance Computing for Engineering Applications “Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Martin Kruliš by Martin Kruliš (v1.0)1.

CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.

CIS 565 Fall 2011 Qing Sun

GPU Architecture and Programming

Chapter 7 Operating Systems. Define the purpose and functions of an operating system. Understand the components of an operating system. Understand the.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Synchronization These notes introduce:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

CUDA C/C++ Basics Part 2 - Blocks and Threads

Multi-GPU Programming

Processes and threads.

GPU Computing CIS-543 Lecture 10: Streams and Events

CUDA Programming Model

Heterogeneous Programming

Host-Device Data Transfer

Pipeline parallelism and Multi–GPU Programming

CS 179 Lecture 14.

More on GPU Programming

Operating Systems Chapter 5: Input/Output Management

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Execution Model – III Streams and Events

CUDA Programming Model

CUDA Programming Model

Synchronization These notes introduce:

Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu

Presentation transcript:

Cuda Streams Presented by Savitha Parur Venkitachalam

Page locked memory / Pinned memory malloc() was used to allocate memory in the host malloc() allocates pageable host memory cudaHostAlloc() allocates a buffer of page-locked memory cudaHostAlloc( (void**)&a, size * sizeof( *a ), cuda HostAllocDefault ) ; cudaFreeHost ( a ); Pagelocked memory guarentees that data will reside in the physical memory i.e OS will never page this memory out to disk

When using a pageable memory (malloc()) CPU copies data from pageable memory to a page locked memory GPU uses direct memory access (DMA) to copy the data to or from the host’s page locked memory buffer copy happens twice when using malloc() Using a pagelocked memory (CudaHostAlloc()) the first copying is not needed Pagelocked memory is fast but uses physical memory (not on the disk) Should be restricted or system may run out of memory

Cuda Streams Streams introduce task parallelism Plays an important role in accelerating the applications A Cuda Stream represents a queue of GPU operations that can be executed in a specific order The order in which the operations are added to a stream specifies the order in which they will be executed

Steps – using one stream Device should support the property ‘device overlap’. Use CudaGetDeviceProperties (&prop, device) to know if the device support device overlap cudaDeviceProp prop; int whichDevice; HANDLE_ERROR( cudaGetDevice( &whichDevice ) ); HANDLE_ERROR( cudaGetDeviceProperties( &prop, whichDevice ) ); if (!prop.deviceOverlap) { printf( "Device will not handle overlaps"); return 0; GPU supporting device overlap possesses the capacity to execute a kernel while performing a copy between device and host memory

Create the stream using cudaStreamCreate() // initialize the stream and create the stream cudaStream_t stream; HANDLE_ERROR( cudaStreamCreate( &stream ) ); Allocate the memory on the host and GPU //pagelocked memory at GPU HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N*sizeof(int) ) ); // allocate page-locked memory HANDLE_ERROR( cudaHostAlloc( (void**)&host_a, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) ); Copy the data from CPU to GPU using cudaMemcpyAsync().When the call returns there is no gurantee that the copy is completed HANDLE_ERROR( cudaMemcpyAsync( dev_a, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream ) );

Kernel launch kernel >> (dev_a, dev_b, dev_c) ; copy back data from device to locked memory HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost, stream ) ); Stream synchronization - waiting for the stream to be finished cudaStreamSynchronize (stream); Free the memory allocated and destroy the stream cudaFreeHost (host_a) cudaFree (dev_a) cudaStreamDestroy (stream)

Multiple Streams Kernels and Memory copies can be performed concurrently as long as they are in multiple streams Some GPU architectures support concurrent memory copies if they are in opposite directions The concurrency with multiple streams improves performance.

Execution time line for 2 streams

GPU Work Scheduling Hardware has no notion of streams Hardware has separate engines to perform memory copies and an engine to execute kernels These engines queues commands that result in a task scheduling When using multiple streams the structure of the program will affect the performance

GPU Scheduling Stream0 : memcpy A Stream0 : memcpy B Stream0 : memcpy C Stream1 : memcpy A Stream1 : memcpy B Stream1 : memcpy C Kernel 0 Kernel 1

More efficient way

References CUDA BY Example – Jason Sanders, Edward Kandrot sAndConcurrencyWebinar.pdf sAndConcurrencyWebinar.pdf _ pdf

Questions