Efficient Intranode Communication in GPU-Accelerated Systems

Slides:



Advertisements
Similar presentations
Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Multi-GPU System Design with Memory Networks
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Exploring Efficient Data Movement Strategies for Exascale Systems with Deep Memory Hierarchies Heterogeneous Memory (or) DMEM: Data Movement for hEterogeneous.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
1 I/O Management in Representative Operating Systems.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Panda: MapReduce Framework on GPU’s and CPU’s
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Supporting GPU Sharing in Cloud Environments with a Transparent
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Extracted directly from:
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,
A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng …Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur …Argonne.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator- Based Systems Presented by: Ashwin M. Aji PhD Candidate, Virginia Tech,
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
My Coordinates Office EM G.27 contact time:
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
Martin Kruliš by Martin Kruliš (v1.1)1.
NFV Compute Acceleration APIs and Evaluation
Productive Performance Tools for Heterogeneous Parallel Computing
CS427 Multicore Architecture and Parallel Computing
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Chapter 9: Virtual-Memory Management
CUDA Execution Model – III Streams and Events
Presentation transcript:

Efficient Intranode Communication in GPU-Accelerated Systems Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-chun Feng, Xiaosong Ma Presented by: James Dinan James Wallace Givens Postdoctoral Fellow Argonne National Laboratory

Fields using GPU Accelerators at Argonne Computed Tomography Micro-tomography Cosmology Bioengineering Bioengineering - ALCF INCITE Irregular heartbeats are one of the leading causes of death around the world. The treatment and prevention of cardiac rhythm disorders is extremely challenging as the electrical signal that control the heart's rhythm is controlled by a complex, multi-scale biological simulation. Jeff Fox (@ Cornell) is conducting large-scale simulations to model this. Micottomography This is experimental data from the Advanced Photon Source(APS) at ANL. THe data was generated by MSD researchers interested in understanding the internal structure of a foam material. This consisted of a stack of 2D slices and a total of 20GB of data. Cosmology - INCITE This simulation follows the growth of density perturbations in both gas and dark matter components in a volume 1 billion light years on a side beginning shortly after the Big Bang and evolved to half the present age of the universe. The animation illustrates the expansion of the universe over time, and highlights how individual structures (such as galaxy clusters) collapse due to gravity, while simultaneously being pushed further and further from each other. Acknowledgement: Venkatram Vishwanath @ ANL

GPU-Based Supercomputers

GPU-Accelerated High Performance Computing GPUs are general purpose, highly parallel processors High FLOPs/Watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem Prog. Models: CUDA, OpenCL, … Clusters with GPUs are becoming common Multiple GPUs per node Nonuniform node architecture Node topology plays role in performance New programmability and performance challenges for programming models and runtime systems Keeneland Node Architecture

Heterogeneity and Intra-node GPU-GPU Xfers GPU-GPU, Same I/O Hub GPU-GPU, Diff. I/O Hub CUDA provides GPU-GPU DMA using CUDA IPC Same I/O hub – DMA best Different I/O hubs – Shared memory best Mismatch between PCIe and QPI ordering semantics

MPI-ACC: Programmability and Performance GPU Global memory Separate address space Manually managed Message Passing Interface Most popular parallel programming model Host memory only Integrate accelerator awareness with MPI (ANL, NCSU, VT) Productivity and performance benefits GPU CPU Multiprocessor MPI rank 0 MPI rank 1 MPI rank 2 MPI rank 3 Shared memory Main memory Global memory PCIe, HT/QPI NIC

Current MPI+GPU Programming MPI operates on data in host memory only Manual copy between host and GPU memory serializes PCIe, Interconnect Can do better than this, but will incur protocol overheads multiple times Productivity: Manual data movement Performance: Inefficient, unless large, non-portable investment in tuning

MPI-ACC Interface: Passing GPU Buffers to MPI MPI-ACC Interface Cost (CPU-CPU) Attributes BUFTYPE=OCL MPI Datatype CL_Context CL_Mem CL_Device_ID CL_Cmd_queue Unified Virtual Address (UVA) space Allow device pointer in MPI routines directly Currently supported only by CUDA and newer NVIDIA GPUs Query cost is high and added to every operation (CPU-CPU) Explicit Interface – e.g. MPI_CUDA_Send(…), overloading MPI Datatypes – Compatible with MPI and many accelerator models

MPI-ACC: Integrated, Optimized Data Movement Use MPI for all data movement Support multiple accelerators and prog. models (CUDA, OpenCL, …) Allow application to portably leverage system-specific optimizations Inter-node data movement [Aji HPCC’12] Pipelining: Fully utilize PCIe and network links GPU direct (CUDA): Multi-device pinning eliminates data copying Handle caching (OpenCL): Avoid expensive command queue creation Intra-node data movement Shared memory protocol [Ji ASHES’12] Sender and receiver drive independent DMA transfers Direct DMA protocol [Ji HPCC’12] GPU-GPU DMA transfer (CUDA IPC) Both protocols needed, PCIe limitations serialize DMA across I/O hubs

Integrated Support for User-Defined Datatypes MPI_Send(buffer, datatype, count, to, … ) MPI_Recv(buffer, datatype, count, from, … ) What if the datatype is noncontiguous? CUDA doesn’t support arbitrary noncontiguous transfers Pack data on the GPU [Jenkins ‘12] Flatten datatype tree representation Packing kernel that can saturate memory bus/banks

Intranode Communication in MPICH2 (Nemesis) Rank 0 Rank 1 Rank 0 Rank 1 Shared Memory Shared Memory Rank 2 Rank 3 Rank 2 Rank 3 Short Message Transport Shared message queue Large, persistent queue Single buffer transport Large Message Transport Point-to-point shared buffer Ring buffer allocated at first connection Multi-buffered transport

Non-integrated Intranode Communication Process 0 Process 1 GPU Nemesis Shr. Memory Host Communication without accelerator integration 2 PCIe data copies + 2 main memory copies Transfers are serialized

Performance Potential: Intranode Bandwidth Bandwidth measurement when using manual data movement Theoretical node bandwidth: 6 GB/sec Achieved for host-host transfers Observed bandwidth: 1.6 – 2.8 GB/sec With one/two GPU buffers – one/two copies

Eliminating Extra Copies Process 0 Process 1 GPU Nemesis Shr. Memory Host Integration allows direct transfer into shared memory buffer LMT: sender and receiver drive transfer concurrently Pipeline data transfer Full utilization of PCIe links

Copying Data Between Host and Device Host to Device Host to Host Three choices for selecting the right copy operation: UVA-Default: Use cudaMemcpy(…, cudaMemcpyDefault) Query-and-copy: UVA query buffer type Dispatch memcpy or cudaMemcpy Parameterized-copy: Pass parameter for each buffer

Large Message Transport Protocol Shared buffer mapped between pairs of communicating processes Enables pipelined transfer Sender and receiver drive DMA concurrently Fixed-size ring Buffer Set of fixed-size partitions R - receiver’s pointer S - sender’s pointer Partition size Set to data length by Sender Set to zero by Receiver S

Extending LMT Protocol to Accelerators Sender and receiver issue asynchronous PCIe data transfers Add Ra and Sa pointers Mark section of R/S segment in use by PCIe transfers Proactively generate PCIe data transfers Move to the next partition Start new PCIe data copy Repeat until full or RB is empty Update R/S when checking PCIe operations later Ra Sa S

Experimental Evaluation Keeneland node architecture 2x Intel Xeon X5660 CPUs, 24 GB Memory, 3x Nvidia M2070 GPUs Parameter optimization: shared ring buffer and message queue Communication benchmarking: OSU latency/bw Stencil2D from SHOC benchmark suite

Eager Parameters: Message Queue Element Size and Eager/LMT Threshold GPU-GPU transfer of varying size “Shared buf” is manual data transfer Large message queue can support more requests Eager up to 64 kB, LMT for +64 kB The same value as host-only

LMT Parameter: Shared Ring Buffer Unit Size Host-to-host Default buffer unit size 32 KB GPU involved Use 256 KB PCIe bandwidth favors larger messages Parameter choice requires knowledge of buffer locations on sender and receiver Exchange information during handshaking phase of LMT protocol Intranode bandwidth – two processes

Latency & Bandwidth Improvement Less impact on D2D case PCIe latency dominant Improvement: 6.7% (D2D), 15.7% (H2D), 10.9% (D2H) Bandwidth discrepancy in different PCIe bus directions Improvement: 56.5% (D2D), 48.7% (H2D), 27.9% (D2H) Nearly saturates peak (6 GB/sec) in D2H case

Application Performance: Stencil2D Kernel Nine-point stencil computation, SHOC benchmark suite Halo exchange with neighboring processes Benefit from latency improvement Relative fraction of communication time decreases with problem size Average execution time improvement of 4.3%

Conclusions Accelerators are ubiquitous, moving target Exciting new opportunities for systems researchers Requires evolution of HPC software stack Integrate accelerator-awareness with MPI Support multiple accelerators and programming models Goals are productivity and performance Optimized Intranode communication Eliminate extra main memory copies Pipeline data flow in ring buffer Optimize data copying

Questions?

Backup Slides

Host-side Memcpy Bandwidth