Multi-GPU Programming

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

Intermediate GPGPU Programming in CUDA
Multi-GPU and Stream Programming Kishan Wimalawarne.
Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
CS533 Concepts of Operating Systems Class 6 The Duality of Threads and Events.
3.5 Interprocess Communication
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CS533 Concepts of Operating Systems Class 2 The Duality of Threads and Events.
CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Cuda Streams Presented by Savitha Parur Venkitachalam.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Martin Kruliš by Martin Kruliš (v1.0)1.
Segmentation & O/S Input/Output Chapter 4 & 5 Tuesday, April 3, 2007.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Martin Kruliš by Martin Kruliš (v1.1)1.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Best Practices for Multi-threading Eric Young Developer Technology.
CSCE451/851 Introduction to Operating Systems
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CUDA C/C++ Basics Part 2 - Blocks and Threads
CS 540 Database Management Systems
Intel Many Integrated Cores Architecture
Virtual memory.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Processes and threads.
GPU Computing CIS-543 Lecture 10: Streams and Events
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Chapter 1: Introduction
Interprocess Communications Continued
Heterogeneous Programming
Chapter 3 – Process Concepts
Task Scheduling for Multicore CPUs and NUMA Systems
Operating System Structure
Basic CUDA Programming
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
INTER-PROCESS COMMUNICATION
CS703 - Advanced Operating Systems
This pointer, Dynamic memory allocation, Constructors and Destructor
CMSC 611: Advanced Computer Architecture
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Threads and Data Sharing
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Architectural Support for OS
Introduction to Operating Systems
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.
LINUX System : Lecture 7 Lecture notes acknowledgement : The design of UNIX Operating System.
CUDA Execution Model – III Streams and Events
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
Architectural Support for OS
Chapter 13: I/O Systems.
GPU Architectures and CUDA in More Detail
More concurrency issues
Presentation transcript:

Multi-GPU Programming Martin Kruliš by Martin Kruliš (v1.1) 05.01.2017

Multi-GPU Systems Connecting Multiple GPUs to Host Workload division and management Sharing PCI-Express/host memory throughput Host architecture example (NUMA): GPU Memory CPU/Chipset PCIe QPI by Martin Kruliš (v1.1) 05.01.2017

Multi-GPU Systems Detection and Selection cudaGetDeviceCount(), cudaSetDevice() Each device may be manually queried for properties cudaGetDeviceProperties(), cudaDeviceGetAttribute() A stream may be created for each device Then we use streams to determine, which device is used Automatic selection of the optimal device cudaChooseDevice(&device, props) Selecting devices by their physical layout cudaDeviceGetByPCIBusId(&device, pciId) cudaDeviceGetPCIBusId() The devices that are visible to the application can be restricted by CUDA_VISIBLE_DEVICES environment variable. It may contain a list of integers that specify the devices visible to the application. The application always lists the devices as 0…N-1, where N is the number of visible devices. by Martin Kruliš (v1.1) 05.01.2017

Workload Division Task Management Similar on GPU and CPU E.g., each task must have sufficient size Static task scheduling Works only in special cases (e.g., all tasks have the same size and all GPUs are identical) Dynamic task scheduling Oversubscription – much more tasks than devices Tasks are dispatched to devices as they become available More complex for GPU, since the copy-work-copy pipeline is required to be maintained by Martin Kruliš (v1.1) 05.01.2017

Peer-to-Peer Transfers Copying memory between devices Special methods that copy memory directly between two devices cudaMemcpyPeer(dst, dstDev, src, srcDev, size) cudaMemcpyPeerAsync(…, stream) Synchronous version is asynchronous towards host, but synchronized with other async operations Works as a barrier on both devices Portable memory allocation Page-locked memory used on multiple GPUs The cudaHostAllocPortable flag must be used by Martin Kruliš (v1.1) 05.01.2017

Peer-to-Peer Memory Access Direct Inter-GPU Data Exchange Possibly without storing in host memory Since CC 2.0 (Tesla devices), 64-bit processes only cudaDeviceCanAccessPeer() cudaDeviceEnablePeerAccess() Unified Virtual Memory Space Both host and device buffers have one VS The unifiedAddressing device property must be 1 cudaPointerGetAttributes() Devices can directly use cudaHostAlloc() pointers by Martin Kruliš (v1.1) 05.01.2017

IPC Inter-Process Communication CUDA resources are restricted to the process Device buffer pointers, events, … Multiprocess sharing may be inevitable (e.g., when integrating multiple CUDA applications) IPC API allows sharing these resources cudaIpcGetMemHandle(), cudaIpcGetEventHandle() return cudaIpcEventHandle_t handle The handle can be transferred via IPC mechanisms cudaIpcOpenMemHandle(), cudaIpcOpenEventHandle() open the handle passed on from another process cudaIpcCloseMemHandle() by Martin Kruliš (v1.1) 05.01.2017

Discussion by Martin Kruliš (v1.1) 05.01.2017