Heterogeneous Programming

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Intermediate GPGPU Programming in CUDA
Multi-GPU and Stream Programming Kishan Wimalawarne.
Optimization on Kepler Zehuan Wang
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
1 I/O Management in Representative Operating Systems.
Cuda Streams Presented by Savitha Parur Venkitachalam.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
CUDA - 2.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
My Coordinates Office EM G.27 contact time:
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
CUDA programming Performance considerations (CUDA best practices)
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Tutorial 2: Homework 1 and Project 1
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Single Instruction Multiple Threads
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Prof. Zhang Gang School of Computer Sci. & Tech.
Multi-GPU Programming
Memory Management.
Processes and threads.
GPU Computing CIS-543 Lecture 10: Streams and Events
Process Management Process Concept Why only the global variables?
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
CS240: Advanced Programming Concepts
OPERATING SYSTEMS CS3502 Fall 2017
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Lecture 5: GPU Compute Architecture
Threads & multithreading
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Lecture 5: GPU Compute Architecture for the last time
More on GPU Programming
Measuring Performance
Top Half / Bottom Half Processing
CUDA Execution Model – III Streams and Events
Chapter 4: Threads.
Measuring Performance
Synchronization These notes introduce:
6- General Purpose GPU Programming
CSC Multiprocessor Programming, Spring, 2011
GPU Architectures and CUDA in More Detail
Presentation transcript:

Heterogeneous Programming Martin Kruliš by Martin Kruliš (v1.0) 01.12.2016

Heterogeneous Programming GPU “Independent” device Controlled by host Used for “offloading” Host Code Needs to be designed in a way that Utilizes GPU(s) efficiently Utilize CPU while GPU is working CPU and GPU do not wait for each other Remember Amdahl’s Law. by Martin Kruliš (v1.0) 01.12.2016

Heterogeneous Programming Bad Example cudaMemcpy(..., HostToDevice); Kernel1<<<...>>>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost); ... Kernel2<<<...>>>(...); CPU GPU Device is working by Martin Kruliš (v1.0) 01.12.2016

Overlapping Work Overlapping CPU and GPU work Kernels Memory transfers Started asynchronously Can be waited for (cudaDeviceSynchronize()) A little more can be done with streams Memory transfers cudaMemcpy() is synchronous and blocking Alternatively cudaMemcpyAsync() starts the transfer and returns immediately Can be synchronized the same way as the kernel by Martin Kruliš (v1.0) 01.12.2016

Workload balance becomes an issue Overlapping Work Using Asynchronous Transfers cudaMemcpyAsync(HostToDevice); Kernel1<<<...>>>(...); cudaMemcpyAsync(DeviceToHost); ... do_something_on_cpu(); cudaDeviceSynchronize(); CPU GPU Workload balance becomes an issue by Martin Kruliš (v1.0) 01.12.2016

Overlapping Work CPU Threads GPU Overlapping Capabilities Multiple CPU threads may use the GPU GPU Overlapping Capabilities Multiple kernels may run simultaneously Since Fermi architecture cudaDeviceProp.concurrentKernels Kernel execution may overlap with data transfers Or even multiple data transfers cudaDeviceProp.asyncEngineCount by Martin Kruliš (v1.0) 01.12.2016

Streams Stream In-order GPU command queue (like in OpenCL) Asynchronous GPU operations are registered in queue Kernel execution Memory data transfers Commands in different streams may overlap Provide means for explicit and implicit synchronization Default stream (stream 0) Always present, does not have to be created Global synchronization capabilities by Martin Kruliš (v1.0) 01.12.2016

Streams Stream Creation Stream Usage Stream Destruction cudaStream_t stream; cudaStreamCreate(&stream); Stream Usage cudaMemcpyAsync(dst, src, size, kind, stream); kernel<<<grid, block, sharedMem, stream>>>(...); Stream Destruction cudaStreamDestroy(stream); by Martin Kruliš (v1.0) 01.12.2016

Streams Synchronization Explicit Implicit cudaStreamSynchronize(stream) – waits until all commands issued to the stream have completed cudaStreamQuery(stream) – a non-blocking test whether the stream has finished Implicit Operations in different streams cannot overlap if a special operation is issued between them Memory allocation A CUDA command to default stream Switch between L1/shared memory configuration by Martin Kruliš (v1.0) 01.12.2016

Streams Overlapping Behavior Commands in different streams overlap if the hardware is capable running them concurrently Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(…HostToDevice, stream[i]); MyKernel<<<g, b, 0, stream[i]>>>(...); cudaMemcpyAsync(…DeviceToHost, stream[i]); } If the device does not support concurrent data transfers, the streams will not overlap at all (HostToDevice copy in stream[1] must wait until DeviceToHost transfer in stream[0] finishes). For CC < 3.0, the kernel execution cannot overlap, since the second kernel is issued after DeviceToHost copy in the stream[0]. May have many implicit synchronizations, depending on CC and hardware overlapping capabilities. by Martin Kruliš (v1.0) 01.12.2016

Much less opportunities for implicit synchronization Streams Overlapping Behavior Commands in different streams overlap if the hardware is capable running them concurrently Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…HostToDevice, stream[i]); MyKernel<<<g, b, 0, stream[i]>>>(...); cudaMemcpyAsync(…DeviceToHost, stream[i]); Much less opportunities for implicit synchronization by Martin Kruliš (v1.0) 01.12.2016

Streams Callbacks Callbacks are registered in streams by cudaStreamAddCallback(stream, fnc, data, 0); The callback function is invoked asynchronously after all preceding commands terminate Callback registered to the default stream is invoked after previous commands in all streams terminate Operations issued after callback registration start after the callback returns The callback looks like void CUDART_CB MyCallback(stream, errorStatus, userData) { ... by Martin Kruliš (v1.0) 01.12.2016

Streams Events Special markers that can be used for synchronization and performance monitoring The typical usage is Waiting for all commands before the marker finishes Explicit synchronization between selected streams Measuring time between two events Example cudaEvent_t event; cudaEventCreate(&event); cudaEventRecord(event, stream); cudaEventSynchronize(event); by Martin Kruliš (v1.0) 01.12.2016

Pipelining Making a Good Use of Overlapping Split the work into smaller fragments Create a pipeline effect (load, process, store) by Martin Kruliš (v1.0) 01.12.2016

Multiple cudaMemcpy() calls may be quite inefficient Feeding Threads Data Gather and Scatter Problem Input Data Host Memory Gather Multiple cudaMemcpy() calls may be quite inefficient Kernel Execution GPU Memory Scatter Results Host Memory by Martin Kruliš (v1.0) 01.12.2016

Feeding Threads Gather and Scatter Reducing overhead Performed by CPU before/after cudaMemcpy Main Thread Gather Scatter Kernel HtD copy DtH copy Stream 0 Stream 1 … # of thread per GPU and # of streams per thread depends on the workload structure by Martin Kruliš (v1.0) 01.12.2016

Optimized for writing, not cached on CPU Page-locked Memory Page-locked (Pinned) Host Memory Host memory that is prevented from swapping Created/dismissed by cudaHostAlloc(), cudaFreeHost() cudaHostRegister(), cudaHostUnregister() Optionally with flags cudaHostAllocWriteCombined cudaHostAllocMapped cudaHostAllocPortable Copies between pinned host memory and device are automatically performed asynchronously Pinned memory is a scarce resource Optimized for writing, not cached on CPU On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining as described in Write-Combining Memory. Write-combining memory frees up the host's L1 and L2 cache resources, making more cache available to the rest of the application. In addition, write-combining memory is not snooped during transfers across the PCI Express bus, which can improve transfer performance by up to 40%. by Martin Kruliš (v1.0) 01.12.2016

Memory Mapping Device Memory Mapping Allowing GPU to access portions of host memory directly (i.e., without explicit copy operations) For both reading and writing The memory must be allocated/registered with flag cudaHostAllocMapped The context must have cudaDeviceMapHost flag (set by cudaSetDeviceFlags()) Function cudaHostGetDevicePointer() gets host pointer and returns corresponding device pointer Note: Atomic operations on mapped memory are not atomic from the host (or other devices) perspective. by Martin Kruliš (v1.0) 01.12.2016

Asynchronous Error Handling Asynchronous Errors An error may occur outside the a CUDA call In case of asynchronous memory transfers or kernel execution The error is reported by the following CUDA call To make sure all errors were reported, the device must synchronize (cudaDeviceSynchronize()) Error handling functions cudaGetLastError() cudaPeekAtLastError() cudaGetErrorString(error) by Martin Kruliš (v1.0) 01.12.2016

Discussion by Martin Kruliš (v1.0) 01.12.2016