Host-Device Data Transfer

Slides:



Advertisements
Similar presentations
Outline Reading Data From Files Double Buffering GMAC ECE
Advertisements

Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
Chapter 7 Interupts DMA Channels Context Switching.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Cuda Streams Presented by Savitha Parur Venkitachalam.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
I/O Software CS 537 – Introduction to Operating Systems.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CUDA C/C++ Basics Part 2 - Blocks and Threads
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
I/O Management.
Module 12: I/O Systems I/O hardware Application I/O Interface
Processes and threads.
GPU Computing CIS-543 Lecture 10: Streams and Events
Chapter 2 Memory and process management
CS427 Multicore Architecture and Parallel Computing
CS 6560: Operating Systems Design
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Lesson Objectives Aims Key Words
Heterogeneous Programming
CSE451 I/O Systems and the Full I/O Path Autumn 2002
CS 3305 System Calls Lecture 7.
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Lecture 5: GPU Compute Architecture for the last time
Page Replacement.
More on GPU Programming
Operating Systems Chapter 5: Input/Output Management
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
CSCE 313 – Introduction to UNIx process
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.
CUDA Execution Model – III Streams and Events
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
CSE 153 Design of Operating Systems Winter 2019
Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment. Phone:
6- General Purpose GPU Programming
Chapter 13: I/O Systems.
Lecture 12 Input/Output (programmer view)
Module 12: I/O Systems I/O hardwared Application I/O Interface
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Host-Device Data Transfer CS 179 Lecture 13 Host-Device Data Transfer

Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects the slowest part of GPU programming: getting data on and off of GPU Everything we’ve talked about so far assumes data is on GPU.

Moving data is important Intelligently moving data allows processing data larger than GPU global memory (~6GB) Absolutely critical for real-time or streaming applications (common in computer vision, data analytics, control systems)

Matrix transpose: another look Time(%) Time Calls Avg Name 49.35% 29.581ms 1 29.581ms [CUDA memcpy DtoH] 47.48% 28.462ms 1 28.462ms [CUDA memcpy HtoD] 3.17% 1.9000ms 1 1.9000ms naiveTransposeKernel Only 3% of time spent in kernel! 97% of time spent moving data onto and off GPU!

Lecture Outline IO strategy CUDA streams CUDA events How it all works: virtual memory, command buffers Pinned host memory Managed memory

A common pattern while (1) { cudaMemcpy(d_input, h_input, input_size) kernel<<<grid, block>>>(d_input, d_output) cudaMemcpy(output, d_output, output_size) } Throughput limited by IO! How can we hide the latency?

Dreams & Reality Reality Dreams time HD 0 kernel 0 DH 0 HD 1 kernel 1 HD = host -> device DH = device -> host What to do we need to make the dream happen?

Turning dreams into reality What do we need to make the dream happen? hardware to run 2 transfers and 1 kernel in parallel 2 input buffers 2 output buffers asynchronous memcpy & kernel invocation easy, up to programmer

Latency hiding checklist Hardware: maximum of 4, 16, or 32 concurrent kernels (depending on hardware) on CC >= 2.0 1 device→host copy engine 1 host→device copy engine (2 copy engines only on newer hardware, some hardware has single copy engine shared for both directions)

Asynchrony An asynchronous function returns as soon is it called. There is generally an interface to check if the function is done and to wait for completion. Kernel launches are asynchronous. cudaMemcpy is not.

cudaMemcpyAsync Convenient asynchronous memcpy! Similar arguments to normal cudaMemcpy. while (1) { cudaMemcpyAsync(d_in, h_in, in_size) kernel<<<grid, block>>>(d_in, d_out) cudaMemcpyAsync(out, d_out, out_size) } Can anyone think of any issues with this code?

CUDA Streams In previous example, need cudaMemcpyAsync to finish before kernel starts. Luckily, CUDA already does this. Streams let us enforce ordering of operations and express dependencies. Useful blog post describing streams Streams are how we deal with asynchronous behavior. Dependencies such as copy must finish before kernel can start.

The null / default stream When stream is not specified, operation only starts after all other GPU operations have finished. CPU code can run concurrently with default stream. Only 1 NULL stream operation runs on GPU at a time

Stream example kernels run in parallel! cudaStream_t s[2]; cudaStreamCreate(&s[0]); cudaStreamCreate(&s[1]); for (int i = 0; i < 2; i++) { kernel<<<grid, block, shmem, s[i]>(d_outs[i], d_ins[i]); cudaMemcpyAsync(h_outs[i], d_outs[i], size, dir, s[i]); } cudaStreamSynchronize(s[i]); cudaStreamDestroy(s[i]); kernels run in parallel! dir is direction, cudaMemcpyHostToDevice, etc.

CUDA events Streams synchronize the GPU (but can synchronize CPU/GPU with cudaStreamSynchronize) Events are simpler way to enforce CPU/GPU synchronization. Also useful for timing!

Events example #define START_TIMER() { \ gpuErrChk(cudaEventCreate(&start)); \ gpuErrChk(cudaEventCreate(&stop)); \ gpuErrChk(cudaEventRecord(start)); \ } #define STOP_RECORD_TIMER(name) { \ gpuErrChk(cudaEventRecord(stop)); \ gpuErrChk(cudaEventSynchronize(stop)); \ gpuErrChk(cudaEventElapsedTime(&name, start, stop)); \ gpuErrChk(cudaEventDestroy(start)); \ gpuErrChk(cudaEventDestroy(stop)); \

Events methods cudaEventRecord - records that an event has occurred. Recording happens not at time of call but after all preceding operations on GPU have finished cudaEventSynchronize - CPU waits for event to be recorded cudaEventElapsedTime - compute time between recording of events

Other stream/event methods cudaStreamAddCallback cudaStreamWaitEvent cudaStreamQuery, cudaEventQuery cudaDeviceSynchronize Can also parameterize event recording to happen only after all preceding operations complete in a given stream (rather than in all streams) addCallback allows lets host code run after all preceding items in stream have finished streamWaitEvent causes a stream to pause until an event is recorded DeviceSynchronize waits on all scheduled GPU operations to end

CPU/GPU communication How do the CPU and GPU communicate? how does a kernel get issued? how can they talk about memory addresses?

Virtual Memory Could give a week of lectures on virtual memory… Key idea: The memory addresses used in programs do not correspond to physical locations in memory. A program deals solely in virtual addresses. There is a table that maps (process id, address) to physical address. Extremely important topic in systems, a beautiful idea.

What does virtual memory gives us? Each process can act like it is the only process running. The same virtual address in different processes can point to different physical addresses (and values). Each process can use more than the total system memory. Store pages of data on disc if there is no room in physical memory. Operating system can move pages around physical memory and disc as needed.

Unified Virtual Addressing On 64-bit OS with GPU of CC >= 2.0, GPU pointers live in disjoint address space from CPU. Makes it possible to figure out which memory an address lives on at runtime. NVIDIA calls it unified virtual addressing (UVA) cudaMemcpy(dst, src, size, cudaMemcpyDefault), no need to specify cudaMemcpyHostToDevice or etc.

Virtual memory and GPU To move data from CPU to GPU, the GPU must access data on host. GPU is given virtual address. 2 options: for each word, have the CPU look up physical address and then perform copy. slow! tell the OS to keep a page at a fixed location (pinning). Directly access physical memory on host from GPU (direct memory access a.k.a. DMA). fast!

Memcpy cudaMemcpy(Async): Pin a host buffer in the driver. Copy data from user array into pinned buffer. Copy data from pinned buffer to GPU.

Command buffers (diagram courtesy of CUDA Handbook) Commands communicated by circular buffer. Host writes, device reads. pinned host memory circular buffer Command buffers (diagram courtesy of CUDA Handbook)

Taking advantage of pinning cudaMallocHost allocates pinned memory on the host. cudaFreeHost to free. Advantages: can dereference pointer to pinned host buffers on device! Lots of PCI-Express (PCI-E) traffic :( cudaMemcpy is considerably faster when copying to/from pinned host memory.

Pinned host memory use cases only need to load and store data once self-referential data structures that are not easy to copy (such as a linked list) deliver output as soon as possible (rather than waiting for kernel completion and memcpy) Must synchronize and wait for kernel to finish before accessing kernel result on host.

Disadvantages of pinning Pinned pages limit freedom of OS memory management. cudaMallocHost will fail (due to no memory available) long before malloc. Coalesced accesses are extra important while accessing pinned host memory. Potentially tricky concurrency issues.

Unified (managed) memory You can think of unified/managed memory as “smart pinned memory”. Driver is allowed to cache memory on host or any GPU. Available on CC >= 3.0 cudaMallocManaged/cudaFree

Unified memory uses & advantages Same use cases as pinned host memory, but also very useful for prototyping (because it’s very easy). You’ll likely be able to output perform managed memory with tuned streams/async memcpy’s, but managed memory gives solid performance for very little effort. Future hardware support (NVLink, integrated GPUs)