Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Intermediate GPGPU Programming in CUDA

Multi-GPU and Stream Programming Kishan Wimalawarne.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

ME964 High Performance Computing for Engineering Applications “Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.

Cuda Streams Presented by Savitha Parur Venkitachalam.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Advanced / Other Programming Models Sathish Vadhiyar.

CIS 565 Fall 2011 Qing Sun

GPU Architecture and Programming

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

OpenCL Programming James Perry EPCC The University of Edinburgh.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

1 OS Review Processes and Threads Chi Zhang

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

Synchronization These notes introduce:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Threads. Thread A basic unit of CPU utilization. An Abstract data type representing an independent flow of control within a process A traditional (or.

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.

Martin Kruliš by Martin Kruliš (v1.0)1.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

My Coordinates Office EM G.27 contact time:

Operating System Concepts

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

CUDA programming Performance considerations (CUDA best practices)

CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.

Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Single Instruction Multiple Threads

Multi-GPU Programming

Processes and threads.

GPU Computing CIS-543 Lecture 10: Streams and Events

CS427 Multicore Architecture and Parallel Computing

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Heterogeneous Programming

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

More on GPU Programming

CUDA Execution Model – III Streams and Events

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Synchronization These notes introduce:

6- General Purpose GPU Programming

GPU Architectures and CUDA in More Detail

Presentation transcript:

Martin Kruliš by Martin Kruliš (v1.0)1

 GPU ◦ “Independent” device ◦ Controlled by host ◦ Used for “offloading”  Host Code ◦ Needs to be designed in a way that  Utilizes GPU(s) efficiently  Utilize CPU while GPU is working  CPU and GPU do not wait for each other by Martin Kruliš (v1.0)2

 Bad Example cudaMemcpy(..., HostToDevice); Kernel1 >>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost);... cudaMemcpy(..., HostToDevice); Kernel2 >>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost); by Martin Kruliš (v1.0)3 CPUGPU Device is working

 Overlapping CPU and GPU work ◦ Kernels  Started asynchronously  Can be waited for ( cudaDeviceSynchronize() )  A little more can be done with streams ◦ Memory transfers  cudaMemcpy() is synchronous and blocking  Alternatively cudaMemcpyAsync() starts the transfer and returns immediately  Can be synchronized the same way as the kernel by Martin Kruliš (v1.0)4

 Using Asynchronous Transfers cudaMemcpyAsync(HostToDevice); Kernel1 >>(...); cudaMemcpyAsync(DeviceToHost);... do_something_on_cpu();... cudaDeviceSynchronize(); by Martin Kruliš (v1.0)5 CPUGPU Workload balance becomes an issue

 CPU Threads ◦ Multiple CPU threads may use the GPU  GPU Overlapping Capabilities ◦ Multiple kernels may run simultaneously  Since Fermi architecture  cudaDeviceProp.concurrentKernels ◦ Kernel execution may overlap with data transfers  Or even multiple data transfers  cudaDeviceProp.asyncEngineCount by Martin Kruliš (v1.0)6

 Stream ◦ In-order GPU command queue (like in OpenCL)  Asynchronous GPU operations are registered in queue  Kernel execution  Memory data transfers  Commands in different streams may overlap  Provide means for explicit and implicit synchronization ◦ Default stream (stream 0)  Always present, does not have to be created  Global synchronization capabilities by Martin Kruliš (v1.0)7

 Stream Creation cudaStream_t stream; cudaStreamCreate(&stream);  Stream Usage cudaMemcpyAsync(dst, src, size, kind, stream); kernel >>(...);  Stream Destruction cudaStreamDestroy(stream); by Martin Kruliš (v1.0)8

 Synchronization ◦ Explicit  cudaStreamSynchronize(stream) – waits until all commands issued to the stream have completed  cudaStreamQuery(stream) – a non-blocking test whether the stream has finished ◦ Implicit  Operations in different streams cannot overlap if a special operation is issued between them  Memory allocation  A CUDA command to default stream  Switch between L1/shared memory configuration by Martin Kruliš (v1.0)9

 Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(…HostToDevice, stream[i]); MyKernel >>(...); cudaMemcpyAsync(…DeviceToHost, stream[i]); } by Martin Kruliš (v1.0)10 May have many implicit synchronizations, depending on CC and hardware overlapping capabilities.

 Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…HostToDevice, stream[i]); for (int i = 0; i < 2; ++i) MyKernel >>(...); for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…DeviceToHost, stream[i]); by Martin Kruliš (v1.0)11 Much less opportunities for implicit synchronization

 Callbacks ◦ Callbacks are registered in streams by cudaStreamAddCallback(stream, fnc, data, 0); ◦ The callback function is invoked asynchronously after all preceding commands terminate ◦ Callback registered to the default stream is invoked after previous commands in all streams terminate ◦ Operations issued after registration start after the callback returns ◦ The callback looks like void CUDART_CB MyCallback(stream, errorStatus, userData) { by Martin Kruliš (v1.0)12

 Events ◦ Special markers that can be used for synchronization and performance monitoring ◦ The typical usage is  Waiting for all commands before the marker finishes  Explicit synchronization between selected streams  Measuring time between two events ◦ Example cudaEvent_t event; cudaEventCreate(&event); cudaEventRecord(event, stream); cudaEventSynchronize(event); by Martin Kruliš (v1.0)13

 Making a Good Use of Overlapping ◦ Split the work into smaller fragments ◦ Create a pipeline effect (load, process, store) by Martin Kruliš (v1.0)14

 Data Gather and Scatter Problem by Martin Kruliš (v1.0)15 Host Memory GPU Memory Host Memory Gather Scatter Kernel Execution Input Data Results Multiple cudaMemcpy() calls may be quite inefficient

 Gather and Scatter ◦ Reducing overhead ◦ Performed by CPU before/after cudaMemcpy by Martin Kruliš (v1.0)16 Main Thread Gather Scatter Kernel HtD copy DtH copy Gather Scatter Kernel HtD copy DtH copy Stream 0 Stream 1 … # of thread per GPU and # of streams per thread depends on the workload structure

 Page-locked (Pinned) Host Memory ◦ Host memory that is prevented from swapping ◦ Created/dismissed by cudaHostAlloc(), cudaFreeHost() cudaHostRegister(), cudaHostUnregister() ◦ Optionally with flags cudaHostAllocWriteCombined cudaHostAllocMapped cudaHostAllocPortable ◦ Copies between pinned host memory and device are automatically performed asynchronously ◦ Pinned memory is a scarce resource by Martin Kruliš (v1.0)17 Optimized for writing, not cached on CPU

 Device Memory Mapping ◦ Allowing GPU to access portions of host memory directly (i.e., without explicit copy operations)  For both reading and writing ◦ The memory must be allocated/registered with flag cudaHostAllocMapped ◦ The context must have cudaDeviceMapHost flag (set by cudaSetDeviceFlags() ) ◦ Function cudaHostGetDevicePointer() gets host pointer and returns corresponding device pointer by Martin Kruliš (v1.0)18

 Asynchronous Errors ◦ An error may occur outside the a CUDA call  In case of asynchronous memory transfers or kernel execution ◦ The error is reported by the following CUDA call ◦ To make sure all errors were reported, the device must synchronize ( cudaDeviceSynchronize() ) ◦ Error handling functions  cudaGetLastError()  cudaPeekAtLastError()  cudaGetErrorString(error) by Martin Kruliš (v1.0)19

by Martin Kruliš (v1.0)20