CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Multi-GPU and Stream Programming Kishan Wimalawarne.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
ME964 High Performance Computing for Engineering Applications “Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second.
Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Cuda Streams Presented by Savitha Parur Venkitachalam.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Martin Kruliš by Martin Kruliš (v1.0)1.
Extracted directly from:
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
CUDA - 2.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
CUDA C/C++ Basics Part 2 - Blocks and Threads
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
GPU Computing CIS-543 Lecture 10: Streams and Events
Heterogeneous Programming
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
More on GPU Programming
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
© 2012 Elsevier, Inc. All rights reserved.
Introduction to CUDA.
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Synchronization These notes introduce:
GPU Scheduling on the NVIDIA TX2:
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
Presentation transcript:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams

Objective To learn more advanced features of the CUDA APIs for data transfer and kernel launch –Task parallelism for overlapping data transfer with kernel computation –CUDA streams

Serialized Data Transfer and GPU computation So far, the way we use cudaMemcpy serializes data transfer and GPU computation Trans. A Trans. BVector Add Tranfer C time Only use one direction, GPU idle PCIe Idle Only use one direction, GPU idle

Device Overlap Some CUDA devices support device overlap –Simultaneously execute a kernel while performing a copy between device and host memory int dev_count; cudaDeviceProp prop; cudaGetDeviceCount( &dev_count); for (int i = 0; i < dev_count; i++) { cudaGetDeviceProperties(&prop, i); if (prop.deviceOverlap) …

Overlapped (Pipelined) Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1 + B.1 Trans B.2 Comp C.2 = A.2 + B.2 Trans A.3 Trans B.3 Trans C.2 Comp C.3 = A.3 + B.3 Trans A.4 Trans B.4

Using CUDA Streams and Asynchronous MemCpy CUDA supports parallel execution of kernels and cudaMemcpy with “Streams” Each stream is a queue of operations (kernel launches and cudaMemcpy’s) Operations (tasks) in different streams can go in parallel –“Task parallelism”

7 Streams Device requests made from the host code are put into a queue –Queue is read and processed asynchronously by the driver and device –Driver ensures that commands in the queue are processed in sequence. Memory copies end before kernel launch, etc. host thread cudaMemcpy Kernel launch sync fifo device driver

8 Streams cont. To allow concurrent copying and kernel execution, you need to use multiple queues, called “streams” –CUDA “events” allow the host thread to query and synchronize with the individual queues. host thread device driver Stream 1 Stream 2 Event

Conceptual View of Streams MemCpy A.1 MemCpy B.1 Kernel 1 MemCpy C.1 MemCpy A.2 MemCpy B.2 Kernel 2 MemCpy C.2 Stream 0 Stream 1 Copy Engine PCIe UP PCIe Down Kernel Engine Operations (Kernels, MemCpys)

A Simple Multi-Stream Host Code cudaStream_tstream0, stream1; cudaStreamCreate( &stream0); cudaStreamCreate( &stream1); float *d_A0, *d_B0, *d_C0;// device memory for stream 0 float *d_A1, *d_B1, *d_C1; // device memory for stream 1 // cudaMalloc for d_A0, d_B0, d_C0, d_A1, d_B1, d_C1 go here

continued for (int i=0; i<n; i+=SegSize*2) { cudaMemCpyAsync(d_A0, h_A+i; SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_B0, h_B+i; SegSize*sizeof(float),.., stream0); vecAdd >> (…); cudaMemCpyAsync(d_C0, h_C+I; SegSize*sizeof(float),.., stream0);

A Simple Multi-Stream Host Code (Cont.) for (int i=0; i<n; i+=SegSize*2) { cudaMemCpyAsync(d_A0, h_A+i; SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_B0, h_B+i; SegSize*sizeof(float),.., stream0); vecAdd >>(d_A0, d_B0, …); cudaMemCpyAsync(d_C0, h_C+I; SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_A1, h_A+i+SegSize; SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_B1, h_B+i+SegSize; SegSize*sizeof(float),.., stream1); vecAdd >>(d_A1, d_B1, …); cudaMemCpyAsync(d_C1, h_C+i+SegSize; SegSize*sizeof(float),.., stream1); }

A View Closer to Reality MemCpy A.1 MemCpy B.1 MemCpy C.1 MemCpy A.2 MemCpy B.2 Kernel 1 Kernel 2 Stream 0 Stream 1 Copy Engine PCI UP PCI Down Kernel Engine Operations (Kernels, MemCpys) MemCpy C.2

Not quite the overlap we want C.1 blocks A.2 and B.2 in the copy engine queue Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1 + B.1 Trans B.2 Comp C.2 = A.2 + B.2

A Better Multi-Stream Host Code (Cont.) for (int i=0; i<n; i+=SegSize*2) { cudaMemCpyAsync(d_A0, h_A+i; SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_B0, h_B+i; SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_A1, h_A+i+SegSize; SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_B1, h_B+i+SegSize; SegSize*sizeof(float),.., stream1); vecAdd >>(d_A0, d_B0, …); vecAdd >>(d_A1, d_B1, …); cudaMemCpyAsync(d_C0, h_C+I; SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_C1, h_C+i+SegSize; SegSize*sizeof(float),.., stream1); }

A View Closer to Reality MemCpy A.1 MemCpy B.1 MemCpy A.2 MemCpy B.2 MemCpy C.1 Kernel 1 Kernel 2 Stream 0 Stream 1 Copy Engine PCI UP PCI Down Kernel Engine Operations (Kernels, MemCpys) MemCpy C.2

Overlapped (Pipelined) Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1 + B.1 Trans B.2 Comp C.2 = A.2 + B.2 Trans A.3 Trans B.3 Trans C.2 Comp C.3 = A.3 + B.3 Trans A.4 Trans B.4

Hyper Queue Provide multiple real queues for each engine Allow much more concurrency by allowing some streams to make progress for an engine while others are blocked

Fermi (and older) Concurrency Fermi allows 16-way concurrency –Up to 16 grids can run at once –But CUDA streams multiplex into a single queue –Overlap only at stream edges P -- Q -- R A -- B -- C X -- Y -- Z Stream 1 Stream 2 Stream 3 Hardware Work Queue A--B--C P--Q--R X--Y--Z

Kepler Improved Concurrency P -- Q -- R A -- B -- C X -- Y -- Z Stream 1 Stream 2 Stream 3 Multiple Hardware Work Queues A--B--CP--Q--RX--Y--Z Kepler allows 32-way concurrency One work queue per stream Concurrency at full-stream level No inter-stream dependencies

ANY QUESTIONS?

Synchronization cudaStreamSynchronize(stream_id) –Used in host code –Takes a stream identifier parameter –Waits until all tasks in the stream have completed This is different from cudaDeviceSynchronize() –Also used in host code –No parameter –Waits until all tasks in all streams have completed for current device