DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,

Slides:

Advertisements

Similar presentations

Prasanna Pandit R. Govindarajan

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Multi-GPU System Design with Memory Networks

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Exploring Efficient Data Movement Strategies for Exascale Systems with Deep Memory Hierarchies Heterogeneous Memory (or) DMEM: Data Movement for hEterogeneous.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Supporting GPU Sharing in Cloud Environments with a Transparent

GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,

Synchronization and Communication in the T3E Multiprocessor.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

GPU Architecture and Programming

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng …Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur …Argonne.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator- Based Systems Presented by: Ashwin M. Aji PhD Candidate, Virginia Tech,

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?

ARCHES: GPU Ray Tracing I.Motivation – Emergence of Heterogeneous Systems II.Overview and Approach III.Uintah Hybrid CPU/GPU Scheduler IV.Current Uintah.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Efficient Intranode Communication in GPU-Accelerated Systems

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Transparent Accelerator Migration in Virtualized GPU Environments Shucai Xiao 1, Pavan Balaji 2, James Dinan 2, Qian Zhu 3, Rajeev Thakur 2, Susan Coghlan.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Multi-GPU Programming

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

CMSC 611: Advanced Computer Architecture

Pipeline parallelism and Multi–GPU Programming

CS 179 Lecture 14.

Presentation transcript:

DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡, Wu-chun Feng†, Xiaosong Ma*& * Department of Computer Science, North Carolina State University † Department of Computer Science, Virginia Tech ‡ Mathematics and Computer Science Division, Argonne National Laboratory & Computer Science and Mathematics Division, Oak Ridge National Laboratory

Background: CPU-GPU Clusters  Graphics Processing Units (GPUs) –Many-core architecture for high performance and efficiency (FLOPs, FLOPs/Watt, FLOPs/$) –Prog. Models: CUDA, OpenCL, OpenACC –Explicitly managed global memory and separate address spaces  CPU clusters –Most popular parallel prog. model: Message Passing Interface (MPI) –Host memory only  Disjoint Memory Spaces! 2 MPI rank 0 MPI rank 1 MPI rank 2 MPI rank 3 NIC Main memory CPU Global memory Shared memory Multiprocessor GPU PCIe, HT/QPI Contact: Feng Ji

GPU-Accelerated High Performance Computing  Clusters with GPUs are becoming common –Multiple GPUs per node –Non-uniform node architecture –Node topology plays role in performance  New programmability and performance challenges for programming models and runtime systems 3 Contact: Feng Ji

Programming CPU-GPU Clusters (e.g: MPI+CUDA) GPU device memory GPU device memory CPU main memory CPU main memory PCIe HT/QPI Rank = 0Rank = 1 if(rank == 0) { cudaMemcpy(host_buf, dev_buf, D2H) MPI_Send(host_buf,....) } if(rank == 1) { MPI_Recv(host_buf,....) cudaMemcpy(dev_buf, host_buf, H2D) } 4 Contact: Feng Ji

Current Limitations of Programming CPU-GPU Clusters (e.g: MPI+CUDA)  Manual blocking copy between host and GPU memory serializes PCIe, Interconnect  Manual non-blocking copy is better, but will incur protocol overheads multiple times  Programmability/Productivity: Manual data movement leading to complex code; Non-portable codes  Performance: Inefficient and non-portable performance optimizations 5 Contact: Feng Ji

Goal of Programming CPU-GPU Clusters (e.g: MPI+Any accelerator) GPU device memory GPU device memory CPU main memory CPU main memory PCIe HT/QPI Rank = 0Rank = 1 if(rank == 0) { MPI_Send(any_buf,....); } if(rank == 1) { MPI_Recv(any_buf,....); } 6 Contact: Feng Ji

MPI-ACC: A unified communication interface  MPIGPU_Send/Recv(user_buf, count, datatype, dest_rank, tag, comm, buftype); 7 Process 0 MPIGPU_Send(d_ptr, …, MPIGPU_BUF_GPU); Process 1 MPIGPU_Recv(d_ptr, …, MPIGPU_BUF_GPU); buftype flag – To tell MPI whether to use PCIe data copy or memcpy Contact: Feng Ji

MPI-ACC: Integrated and Optimized Data Movement  MPI-ACC: integrates accelerator awareness with MPI for all data movement –Programmability/Productivity: supports multiple accelerators and prog. models (CUDA, OpenCL) –Performance: allows applications to portably leverage system-specific and vendor-specific optimizations 8 Contact: Feng Ji GPU CPU GPU ……… Network … …

MPI-ACC: Integrated and Optimized Data Movement  MPI-ACC: integrates accelerator awareness with MPI for all data movement “MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems” [HPCC ‘12] –Intranode Optimizations “DMA-Assisted, Intranode Communication in GPU-Accelerated Systems”, Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Rajeev Thakur, Wu-chun Feng and Xiaosong Ma [This paper] “Efficient Intranode Communication in GPU-Accelerated Systems”, Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-Chun Feng and Xiaosong Ma. [AsHES ‘12] –Noncontiguous Datatypes “Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments”, John Jenkins, James Dinan, Pavan Balaji, Nagiza F. Samatova, and Rajeev Thakur. Under review at IEEE Cluster Contact: Feng Ji

Intranode Optimizations: shared memory protocol [ASHES ’12]  Directly connect PCIe data copies into MPI internal shared memory  Transparently doing chunking and pipelining 10 GPU Global Mem Host Main Mem Process 0 Process 1 Shared Mem Process 0 MPIGPU_Send(d_ptr, …, MPIGPU_BUF_GPU); Process 1 MPIGPU_Recv(d_ptr, …, MPIGPU_BUF_GPU); Contact: Feng Ji

GPUDirect + CUDAIPC  GPUDirect: DMA-driven peer GPU copy  CUDAIPC: exporting a GPU buffer to a different process 11 Device Mem Main Mem Process 0 cudaIpcGetMemHandle(&handle, d_ptr); Process 1 cudaIpcOpenMemHandl(&d_ptr_src, handle); cudaMemcpy(d_ptr, d_ptr_src, …); Direct copy Handle Contact: Feng Ji

DMA-assisted Intranode GPU data transfer  Motivation –Eliminate going through MPI host-side shared memory –Reduce the complexity of doing the same thing in the application level Need NOT re-invent synchronization mechanism 12 Process 0 cudaIpcGetMemHandle(&handle, d_ptr); MPI_Send(handle, …); MPI_Recv(Msg_done, …); Process 1 MPI_Recv(handle, …); cudaIpcOpenMemHandl(&d_ptr_src, handle); cudaMemcpy(d_ptr, d_ptr_src, …); MPI_Send(Msg_done, …); Contact: Feng Ji

DMA-assisted Intranode GPU data transfer  Challenges –GPUDirect requires GPU peer accessibility Same IO/Hub: Yes Different IO/Hub: Yes for AMD (HT); No for Intel (QPI) –Overhead of handle open/close –MPI is unaware of GPU device topology 13 Contact: Feng Ji GPU 1 GPU 0 GPU 1 GPU 0 GPU 2 Intel (with QPI)AMD (with HT)

Extend Large Message Transport (LMT) protocol  LMT Protocol supports PUT/GET/COOP modes  Handles carried in packets/cookies  Sender passes GPU info (with handle) to Receiver –Sender always tries to get the handle –Getting a handle is a light-weight op. (according to NVIDIA) –Receiver can decide GPU peer accessibility –Receiver passes the decision in CTS to Sender  Fall back option: via shared memory [AsHES ‘12]  PUT mode 14 SenderReceiver RTS CTS Done PUT Get Src handle Decide accessibility Get Dest handle Open Dest handle Contact: Feng Ji

Extend Large Message Transport (LMT) protocol  GET Mode  COOP Mode 15 SenderReceiver RTS CTS Done PUT/GET/COOP Get Src handle Decide accessibility Get Dest handle Open Src handle Open Dest handle SenderReceiver RTS CTS Done GET Get Src handle Decide accessibility Open Src handle Contact: Feng Ji

IPC Open/Close overhead  Getting a handle is light-weight operation  Open and close operations on the handle are NOT  Do not close a handle when a pair of MPI_Send/Recv is done –Many MPI program reuse buffers (including GPU buffers) –Lazy close option to avoid overhead  Cache opened handles & their addresses locally  Next time: try to find the handle in the local cache –If found, no need to reopen it, but use its address 16 Contact: Feng Ji

Evaluation 17 SystemsKeeneland CPUIntelAMD NUMA Nodes24 InterconnectIntel QPIAMD HT GPUs32 GPU TopologyGPU 0: N0; GPU 1,2: N1 GPU 0: N0; GPU 1: N3 GPU Peer accessibilityOnly GPU 1 ~ 2Yes Distancesame I/O Hub (Near)2 PCIe + 1 HT (Far) GPU 1 GPU 0 GPU 1 GPU 0 GPU 2 Contact: Feng Ji

Near case: Keeneland  Bandwidth nearly reaches the peak bandwidth of the system 18 Contact: Feng Ji GPU 1 GPU 0 GPU 2

Far case: Magellan  Better to adopt shared memory approach 19 Contact: Feng Ji GPU 1 GPU 0

Stencil2D (SHOC) on Keeneland  Compared against using previous shared memory based approach –Avg 4.7% improvement (single precision) & 2.3% (double precision)  Computation increases O(n 2 ) with problem size, thus communication reduces 20 Contact: Feng Ji

Conclusion  Accelerators are getting ubiquitous –Exciting new opportunities for systems researchers –Requires evolution of HPC software stack & more openness of GPU system stack  Integrated accelerator-awareness with MPI –Supported multiple accelerators and programming models –Goals are productivity and performance  Optimized Intranode communication –Utilized GPU DMA engine –Eliminated going through MPI main memory buffer 21 Questions? Contact Feng Ji Ashwin Aji Pavan Balaji Contact: Feng Ji

Back up Starts … 22Contact: Feng Ji

Far case: explanation 23 DMA GPU 0 Mem GPU 1 GPU 0 Contact: Feng Ji