© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

Multi-GPU and Stream Programming Kishan Wimalawarne.

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Cuda Streams Presented by Savitha Parur Venkitachalam.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Martin Kruliš by Martin Kruliš (v1.0)1.

Extracted directly from:

CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

Synchronization These notes introduce:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

Martin Kruliš by Martin Kruliš (v1.0)1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

© David Kirk/NVIDIA and Wen-mei W

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.

CUDA C/C++ Basics Part 2 - Blocks and Threads

GPU Computing CIS-543 Lecture 10: Streams and Events

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

Pipeline parallelism and Multi–GPU Programming

CS 179 Lecture 14.

Introduction to CUDA.

CUDA Execution Model – III Streams and Events

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming Lecture 18: Data Transfer and CUDA Streams

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Objective To learn more advanced features of the CUDA APIs for data transfer and kernel launch –Task parallelism for overlapping data transfer with kernel computation –CUDA streams

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Serialized Data Transfer and GPU computation So far, the way we use cudaMemcpy serializes data transfer and GPU computation Transfer A Transfer BVector Add Transfer C time Only use one direction, GPU idle PCIe Idle Only use one direction, GPU idle

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Device Overlap Some CUDA devices support device overlap –Simultaneously execute a kernel while performing a copy between device and host memory int dev_count; cudaDeviceProp prop; cudaGetDeviceCount( &dev_count); for (int i = 0; i < dev_count; i++) { cudaGetDeviceProperties(&prop, i); if (prop.deviceOverlap) …

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Overlapped (Pipelined) Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1 + B.1 Trans B.2 Comp C.2 = A.2 + B.2 Trans A.3 Trans B.3 Trans C.2 Comp C.3 = A.3 + B.3 Trans A.4 Trans B.4

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Using CUDA Streams and Asynchronous MemCpy CUDA supports parallel execution of kernels and cudaMemcpy with “Streams” Each stream is a queue of operations (kernel launches and cudaMemcpy’s) Operations (tasks) in different streams can go in parallel –“Task parallelism”

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign 7 Streams Device requests made from the host code are put into a queue –Queue is read and processed asynchronously by the driver and device –Driver ensures that commands in the queue are processed in sequence. Memory copies end before kernel launch, etc. host thread cudaMemcpy Kernel launch sync fifo device driver

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign 8 Streams cont. To allow concurrent copying and kernel execution, you need to use multiple queues, called “streams” –CUDA “events” allow the host thread to query and synchronize with the individual queues. host thread device driver Stream 1 Stream 2 Event

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Conceptual View of Streams MemCpy A.1 MemCpy B.1 Kernel 1 MemCpy C.1 MemCpy A.2 MemCpy B.2 Kernel 2 MemCpy C.2 Stream 0 Stream 1 Copy Engine PCIe UP PCIe Down Kernel Engine Operations (Kernels, MemCpys)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign A Simple Multi-Stream Host Code cudaStream_tstream1, stream2; cudaStreamCreate( &stream1); cudaStreamCreate( &stream2); float *d_A1, *d_B1, *d_C1;// device memory for stream 1 float *d_A2, *d_B2, *d_C2; // device memory for stream 2 // cudaMalloc for d_A1, d_B1, d_C1, d_A2, d_B2, d_C2 go here for (int i=0; i<n; i+=SegSize*2) { cudaMemCpyAsync(d_A1, h_A+i; SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_B1, h_B+i; SegSize*sizeof(float),.., stream1); vecAdd >>(d_A1, d_A2, …); cudaMemCpyAsync(d_C1, h_C+i; SegSize*sizeof(float),.., stream1);

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign A Simple Multi-Stream Host Code (Cont.) for (int i=0; i<n; i+=SegSize*2) { cudaMemCpyAsync(d_A1, h_A+i; SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_B1, h_B+i; SegSize*sizeof(float),.., stream1); vecAdd >>(d_A1, d_B1, …); cudaMemCpyAsync(d_C1, h_C+i; SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_A2, h_A+i+SegSize; SegSize*sizeof(float),.., stream2); cudaMemCpyAsync(d_B2, h_B+i+SegSize; SegSize*sizeof(float),.., stream2); vecAdd >>(d_A2, d_B2, …); cudaMemCpyAsync(d_C2, h_C+i+SegSize; SegSize*sizeof(float),.., stream2); }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign A View Closer to Reality MemCpy A.1 MemCpy B.1 MemCpy C.1 MemCpy A.2 MemCpy B.2 Kernel 1 Kernel 2 Stream 0 Stream 1 Copy Engine PCI UP PCI Down Kernel Engine Operations (Kernels, MemCpys) MemCpy C.2

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Not quite the overlap we want C.1 blocks A.2 and B.2 in the copy engine queue Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1 + B.1 Trans B.2 Comp C.2 = A.2 + B.2

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign A Better Multi-Stream Host Code (Cont.) for (int i=0; i<n; i+=SegSize*2) { cudaMemCpyAsync(d_A1, h_A+i; SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_B1, h_B+i; SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_A2, h_A+i+SegSize; SegSize*sizeof(float),.., stream2); cudaMemCpyAsync(d_B2, h_B+i+SegSize; SegSize*sizeof(float),.., stream2); vecAdd >>(d_A1, d_B1, …); vecAdd >>(d_A2, d_B2, …); cudaMemCpyAsync(d_C1, h_C+i; SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_C2, h_C+i+SegSize; SegSize*sizeof(float),.., stream2); }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign C.1 no longer blocks A.2 and B.2 MemCpy A.1 MemCpy B.1 MemCpy A.2 MemCpy B.2 MemCpy C.1 Kernel 1 Kernel 2 Stream 0 Stream 1 Copy Engine PCI UP PCI Down Kernel Engine Operations (Kernels, MemCpys) MemCpy C.2

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Better, not quite the best overlap C.2 blocks next iteration of A.1 and B.1 in the copy engine queue Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1+B.1 Trans B.2 Comp C.2= A.2+B.2 Trans C.2 Trans A.1 Trans B.1 Trans A.2 Comp C.1=A.1+

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Ideal (Pipelined) Timing Need at least three buffers Code is much more complicated Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1 + B.1 Trans B.2 Comp C.2 = A.2 + B.2 Trans A.3 Trans B.3 Trans C.2 Comp C.3 = A.3 + B.3 Trans A.4 Trans B.4

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Hyper Queue Provide multiple real queues for each engine Allow much more concurrency by allowing some streams to make progress for an engine while others are blocked

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Fermi (and older) Concurrency Fermi allows 16-way concurrency –Up to 16 grids can run at once –But CUDA streams multiplex into a single queue –Overlap only at stream edges P -- Q -- R A -- B -- C X -- Y -- Z Stream 1 Stream 2 Stream 3 Hardware Work Queue A--B--C P--Q--R X--Y--Z

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Kepler Improved Concurrency P -- Q -- R A -- B -- C X -- Y -- Z Stream 1 Stream 2 Stream 3 Multiple Hardware Work Queues A--B--CP--Q--RX--Y--Z Kepler allows 32-way concurrency One work queue per stream Concurrency at full-stream level No inter-stream dependencies

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ANY QUESTIONS?