© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Outline Reading Data From Files Double Buffering GMAC ECE

Multi-GPU and Stream Programming Kishan Wimalawarne.

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Lecture 10: GPU as part of the PC Architecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Cuda Streams Presented by Savitha Parur Venkitachalam.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 6: GPU as part of the PC Architecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Martin Kruliš by Martin Kruliš (v1.0)1.

CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system.

Lecture 25 PC System Architecture PCIe Interconnect

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Synchronization These notes introduce:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 7: GPU as part of the PC Architecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.

© David Kirk/NVIDIA and Wen-mei W

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

CUDA C/C++ Basics Part 2 - Blocks and Threads

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.

GPU Computing CIS-543 Lecture 10: Streams and Events

Heterogeneous Programming

Host-Device Data Transfer

Mattan Erez The University of Texas at Austin

Pipeline parallelism and Multi–GPU Programming

L18: CUDA, cont. Memory Hierarchy and Examples

CS 179 Lecture 14.

Operating Systems Chapter 5: Input/Output Management

Mattan Erez The University of Texas at Austin

CUDA Execution Model – III Streams and Events

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

Mattan Erez The University of Texas at Austin

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming Lecture 19: GPU System Architecture

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Imaging for IoT Principles of a camera Design of a CMOS imaging system Mobile Imaging in Android / iOS Image processing basics Image / Video coding Computer vision basics 3D imaging Applications: robots, cars, drones, VR/AR, mobile devices, things Project: build something cool Looking for ~4 students to help me create this course Juniors/Seniors/MS students – CompE+DSP background

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Project Timeline Today: Project Proposal (5 slides max / PPT) Week of Nov 1 st : project review #1 with staff Week of Nov 16 th : project review #2 with staff Week of Dec 7 th : demo to staff Week of Dec 7 th : poster session

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign PCIe PC Architecture

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Data Transfer Overheads The overheads between key components ultimately dictates system performance –Especially true for massively parallel systems processing massive amount of data –Tricks like buffering, reordering, caching can temporarily defy the rules in some cases –Ultimately, the performance falls back to what the data transfer architecture dictates

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign GeForce 7800 GTX Board Details 256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32 16x PCI-Express SLI Connector DVI x 2 sVideo TV Out Single slot cooling

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign PCIe Data Transfer using DMA DMA (Direct Memory Access) is used to fully utilize the bandwidth of an I/O bus –DMA uses physical address for source and destination –Transfers a number of bytes requested by OS –But what about paging?? Main Memory (DRAM) GPU card (or other I/O cards) CPU DMA Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Pinned Memory DMA uses physical addresses The OS could accidentally page out the data that is being read or written by a DMA and page in another virtual page into the same location Pinned memory cannot not be paged out If a source or destination of a cudaMemCpy() in the host memory is not pinned, it needs to be first copied to a pinned memory – extra overhead cudaMemcpy is much faster with pinned host memory source or destination

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Allocate/Free Pinned Memory (a.k.a. Page Locked Memory) cudaHostAlloc() –Three parameters –Address of pointer to the allocated memory –Size of the allocated memory in bytes –Option – use cudaHostAllocDefault for now cudaFreeHost() –One parameter –Pointer to the memory to be freed

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Using Pinned Memory Use the allocated memory and its pointer the same way those returned by malloc(); The only difference is that the allocated memory cannot be paged by the OS The cudaMemcpy function should be about 2X faster with pinned memory Pinned memory is a limited resource whose over-subscription can have serious consequences

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Serialized Data Transfer and GPU computation So far, the way we use cudaMemcpy serializes data transfer and GPU computation Trans. A Trans. BVector Add Tranfer C time Only use one direction, GPU idle PCIe Idle Only use one direction, GPU idle

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Overlapped (Pipelined) Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1 + B.1 Trans B.2 Comp C.2 = A.2 + B.2 Trans A.3 Trans B.3 Trans C.2 Comp C.3 = A.3 + B.3 Trans A.4 Trans B.4

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Using CUDA Streams and Asynchronous MemCpy CUDA supports parallel execution of kernels and cudaMemcpy with “Streams” Each stream is a queue of operations (kernel launches and cudaMemcpy’s) Operations (tasks) in different streams can go in parallel –“Task parallelism”

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign 14 Streams Device requests made from the host code are put into a queue –Queue is read and processed asynchronously by the driver and device –Driver ensures that commands in the queue are processed in sequence. Memory copies end before kernel launch, etc. host thread cudaMemcpy Kernel launch sync fifo device driver

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign 15 Streams cont. To allow concurrent copying and kernel execution, you need to use multiple queues, called “streams” –CUDA “events” allow the host thread to query and synchronize with the individual queues. host thread device driver Stream 1 Stream 2 Event

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Conceptual View of Streams MemCpy A.1 MemCpy B.1 Kernel 1 MemCpy C.1 MemCpy A.2 MemCpy B.2 Kernel 2 MemCpy C.2 Stream 0 Stream 1 Copy Engine PCIe UP PCIe Down Kernel Engine Operations (Kernels, MemCpys)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign A Simple Multi-Stream Host Code cudaStream_tstream0, stream1; cudaStreamCreate( &stream0); cudaStreamCreate( &stream1); float *d_A0, *d_B0, *d_C0;// device memory for stream 0 float *d_A1, *d_B1, *d_C1; // device memory for stream 1 // cudaMalloc for d_A0, d_B0, d_C0, d_A1, d_B1, d_C1 go here for (int i=0; i<n; i+=SegSize*2) { cudaMemCpyAsync(d_A0, h_A+i, SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_B0, h_B+i, SegSize*sizeof(float),.., stream0); vecAdd<<<SegSize/256, 256, 0, stream0); cudaMemCpyAsync(d_C0, h_C+I, SegSize*sizeof(float),.., stream0);

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign A Simple Multi-Stream Host Code (Cont.) for (int i=0; i<n; i+=SegSize*2) { cudaMemCpyAsync(d_A0, h_A+i, SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_B0, h_B+i, SegSize*sizeof(float),.., stream0); vecAdd<<<SegSize/256, 256, 0, stream0)(d_A0, d_B0, …); cudaMemCpyAsync(d_C0, h_C+I, SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),.., stream1); vecAdd >>(d_A1, d_B1, …); cudaMemCpyAsync(d_C1, h_C+i+SegSize, SegSize*sizeof(float),.., stream1); }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign A View Closer to Reality MemCpy A.1 MemCpy B.1 MemCpy C.1 MemCpy A.2 MemCpy B.2 Kernel 1 Kernel 2 Stream 0 Stream 1 Copy Engine PCI UP PCI Down Kernel Engine Operations (Kernels, MemCpys) MemCpy C.2

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Not quite the overlap we want C.1 blocks A.2 and B.2 in the copy engine queue Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1 + B.1 Trans B.2 Comp C.2 = A.2 + B.2

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign A Better Multi-Stream Host Code (Cont.) for (int i=0; i<n; i+=SegSize*2) { cudaMemCpyAsync(d_A0, h_A+i, SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_B0, h_B+i, SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),.., stream1); cudaMemCpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),.., stream1); vecAdd<<<SegSize/256, 256, 0, stream0)(d_A0, d_B0, …); vecAdd >>(d_A1, d_B1, …); cudaMemCpyAsync(d_C0, h_C+I; SegSize*sizeof(float),.., stream0); cudaMemCpyAsync(d_C1, h_C+i+SegSize, SegSize*sizeof(float),.., stream1); }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign A View Closer to Reality MemCpy A.1 MemCpy B.1 MemCpy A.2 MemCpy B.2 MemCpy C.1 Kernel 1 Kernel 2 Stream 0 Stream 1 Copy Engine PCI UP PCI Down Kernel Engine Operations (Kernels, MemCpys) MemCpy C.2

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Overlapped (Pipelined) Timing Trans A.1 Trans B.1 Trans C.1 Trans A.2 Comp C.1 = A.1 + B.1 Trans B.2 Comp C.2 = A.2 + B.2 Trans A.3 Trans B.3 Trans C.2 Comp C.3 = A.3 + B.3 Trans A.4 Trans B.4

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Hyper Queue Provide multiple real queues for each engine Allow much more concurrency by allowing some streams to make progress for an engine while others are blocked

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Fermi (and older) Concurrency Fermi allows 16-way concurrency –Up to 16 grids can run at once –But CUDA streams multiplex into a single queue –Overlap only at stream edges P -- Q -- R A -- B -- C X -- Y -- Z Stream 1 Stream 2 Stream 3 Hardware Work Queue A--B--C P--Q--R X--Y--Z

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Kepler Improved Concurrency P -- Q -- R A -- B -- C X -- Y -- Z Stream 1 Stream 2 Stream 3 Multiple Hardware Work Queues A--B--CP--Q--RX--Y--Z Kepler allows 32-way concurrency One work queue per stream Concurrency at full-stream level No inter-stream dependencies

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign GPU/CPU integration AMD Trinity (Oct 2012) Qualcomm Snapdragon