MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator- Based Systems Presented by: Ashwin M. Aji PhD Candidate, Virginia Tech,

Slides:



Advertisements
Similar presentations
Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.
Exploring Efficient Data Movement Strategies for Exascale Systems with Deep Memory Hierarchies Heterogeneous Memory (or) DMEM: Data Movement for hEterogeneous.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Evaluating GPU Passthrough in Xen for High Performance Cloud Computing Andrew J. Younge 1, John Paul Walters 2, Stephen P. Crago 2, and Geoffrey C. Fox.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Contemporary Languages in Parallel Computing Raymond Hummel.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Supporting GPU Sharing in Cloud Environments with a Transparent
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng …Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur …Argonne.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Efficient Intranode Communication in GPU-Accelerated Systems
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
Full and Para Virtualization
Outline Why this subject? What is High Performance Computing?
Contemporary Languages in Parallel Computing Raymond Hummel.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Transparent Accelerator Migration in Virtualized GPU Environments Shucai Xiao 1, Pavan Balaji 2, James Dinan 2, Qian Zhu 3, Rajeev Thakur 2, Susan Coghlan.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL Ashwin M. Aji, Antonio J. Pena, Pavan Balaji and Wu-Chun Feng Virginia Tech and.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
Martin Kruliš by Martin Kruliš (v1.1)1.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Multi-GPU Programming
Intel Many Integrated Cores Architecture
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Presentation transcript:

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator- Based Systems Presented by: Ashwin M. Aji PhD Candidate, Virginia Tech, USA synergy.cs.vt.edu Ashwin M. Aji, Wu-chun Feng …….. Virginia Tech, USA James Dinan, Darius Buntinas, Pavan Balaji, Rajeev Thakur …….. Argonne National Lab., USA Keith R. Bisset …….. Virginia Bioinformatics Inst., USA

Summary of the Talk  We discuss the current limitations of data movement in accelerator-based systems (e.g: CPU-GPU clusters) –Programmability/Productivity limitations –Performance limitations  We introduce MPI-ACC, our solution towards mitigating these limitations on a variety of platforms including CUDA and OpenCL  We evaluate MPI-ACC on benchmarks and a large scale epidemiology application –Improvement in end-to-end data transfer performance between accelerators –Enabling the application developer to do new data-related optimizations Contact: Ashwin M. Aji 2

Accelerator-Based Supercomputers (Nov 2011) 3 Contact: Ashwin M. Aji

Accelerating Science via Graphics Processing Units (GPUs) 4 Computed TomographyMicro-tomography Cosmology Bioengineering Courtesy: Argonne National Lab

Background: CPU-GPU Clusters  Graphics Processing Units (GPUs) –Many-core architecture for high performance and efficiency (FLOPs, FLOPs/Watt, FLOPs/$) –Prog. Models: CUDA, OpenCL, OpenACC –Explicitly managed global memory and separate address spaces  CPU clusters –Most popular parallel prog. model: Message Passing Interface (MPI) –Host memory only  Disjoint Memory Spaces! 5 MPI rank 0 MPI rank 1 MPI rank 2 MPI rank 3 NIC Main memory CPU Global memory Shared memory Multiprocessor GPU PCIe Contact: Ashwin M. Aji

Programming CPU-GPU Clusters (e.g: MPI+CUDA) GPU device memory GPU device memory CPU main memory CPU main memory PCIe Network Rank = 0Rank = 1 if(rank == 0) { cudaMemcpy(host_buf, dev_buf, D2H) MPI_Send(host_buf,....) } if(rank == 1) { MPI_Recv(host_buf,....) cudaMemcpy(dev_buf, host_buf, H2D) } 6 Contact: Ashwin M. Aji

Goal of Programming CPU-GPU Clusters (e.g: MPI+Any accelerator) GPU device memory GPU device memory CPU main memory CPU main memory PCIe Network Rank = 0Rank = 1 if(rank == 0) { MPI_Send(any_buf,....); } if(rank == 1) { MPI_Recv(any_buf,....); } 7 Contact: Ashwin M. Aji

Current Limitations of Programming CPU-GPU Clusters (e.g: MPI+CUDA)  Manual blocking copy between host and GPU memory serializes PCIe, Interconnect  Manual non-blocking copy is better, but will incur protocol overheads multiple times  Programmability/Productivity: Manual data movement leading to complex code; Non-portable codes  Performance: Inefficient and non-portable performance optimizations 8 Contact: Ashwin M. Aji

MPI-ACC: Integrated and Optimized Data Movement  MPI-ACC: integrates accelerator awareness with MPI for all data movement –Programmability/Productivity: supports multiple accelerators and prog. models (CUDA, OpenCL) –Performance: allows applications to portably leverage system-specific and vendor-specific optimizations 9 Contact: Ashwin M. Aji GPU CPU GPU ……… Network … …

MPI-ACC: Integrated and Optimized Data Movement  MPI-ACC: integrates accelerator awareness with MPI for all data movement “MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems” [This paper] –Intra-node Optimizations “DMA-Assisted, Intranode Communication in GPU-Accelerated Systems”, Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Rajeev Thakur, Wu-chun Feng and Xiaosong Ma [HPCC ‘12] “Efficient Intranode Communication in GPU-Accelerated Systems”, Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-Chun Feng and Xiaosong Ma. [AsHES ‘12] –Noncontiguous Datatypes “Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments”, John Jenkins, James Dinan, Pavan Balaji, Nagiza F. Samatova, and Rajeev Thakur. Under review at IEEE Cluster Contact: Ashwin M. Aji

MPI-ACC Application Programming Interface (API)  How to pass the GPU buffers to MPI-ACC? 1.Explicit Interfaces – e.g. MPI_CUDA_Send(…), MPI_OpenCL_Recv, etc 2.MPI Datatypes attributes –Can extend inbuilt datatypes: MPI_INT, MPI_CHAR, etc. –Can create new datatypes: E.g. MPI_Send(buf, new_datatype, …); –Compatible with MPI and many accelerator models 11 MPI Datatype CL_Context CL_Mem CL_Device_ID CL_Cmd_queue Attributes BUFTYPE=OCL

Optimizations within MPI-ACC  Pipelining to fully utilize PCIe and network links [Generic]  Dynamic choice of pipeline parameters based on NUMA and PCIe affinities [Architecture-specific]  Handle caching (OpenCL): Avoid expensive command queue creation [System/Vendor-specific]  Case study of an epidemiology simulation: Optimize data marshaling in faster GPU memory 12 Contact: Ashwin M. Aji Application-specific Optimizations using MPI-ACC

Pipelined GPU-GPU Data Transfer (Send)  Host Buffer instantiated during MPI_Init and destroyed during MPI_Finalize –OpenCL buffers handled differently from CUDA  Classic n-buffering technique  Intercepted the MPI progress engine GPU (Device) CPU (Host) Network CPU (Host) GPU Buffer Host side Buffer pool 13 Contact: Ashwin M. Aji Without Pipelining With Pipelining Time 29% better than manual blocking 14.6% better than manual non-blocking

Explicit MPI+CUDA Communication in an Epidemiology Simulation Contact: Ashwin M. Aji 14 Network PE i (Host CPU) GPU i (Device) Packing data in CPU (H-H) Copying packed data to GPU (H-D) PE 1 PE 2 PE 3 PE i-3 PE i-2 PE i-1 ……….. Asynchronous MPI calls

Communication with MPI-ACC Contact: Ashwin M. Aji 15 PE 1 PE 2 PE 3 PE i-3 PE i-2 PE i-1 ……….. Asynchronous MPI-ACC calls PE i (Host CPU) GPU i (Device) Pipelined data transfers to GPU Network Packing data in GPU (D-D)

MPI-ACC Comparison MPI+CUDAMPI-ACC D-D Copy (Packing)Yes GPU Receive Buffer InitYes H-D Copy (Transfer)Yes H-H Copy (Packing)Yes CPU Receive Buffer InitYes Contact: Ashwin M. Aji 16 Data packing and initialization moved from CPU’s main memory to the GPU’s device memory Network PE i (Host CPU) GPU i (Device) (H-H) (H-D) PE i (Host CPU) GPU i (Device) (D-D) (Pipelining)

Experimental Platform  Hardware: 4-node GPU cluster –CPU: dual oct-core AMD Opteron (Magny-Cours) per node 32GB memory –GPU: NVIDIA Tesla C2050 3GB global memory  Software: –CUDA v4.0 (driver v ) –OpenCL v1.1 Contact: Ashwin M. Aji 17

Evaluating the Epidemiology Simulation with MPI-ACC Contact: Ashwin M. Aji 18 GPU has two orders of magnitude faster memory MPI-ACC enables new application-level optimizations

Conclusions  Accelerators are becoming mainstream in HPC –Exciting new opportunities for systems researchers –Requires evolution of HPC software stack  MPI-ACC: Integrated accelerator-awareness with MPI –Supported multiple accelerators and programming models –Achieved productivity and performance improvements  Optimized Internode communication –Pipelined data movement, NUMA and PCIe affinity aware and OpenCL specific optimizations  Optimized an epidemiology simulation using MPI-ACC –Enabled new optimizations that were impossible without MPI-ACC 19 Questions? Contact Ashwin Aji Pavan Balaji

Backup Slides Contact: Ashwin M. Aji 20

 How to pass the GPU buffers to MPI-ACC? 1.Explicit Interfaces – e.g. MPI_CUDA_Send(…), MPI_OpenCL_Recv, etc 2.MPI Datatypes attributes 3.Retain the existing MPI interface itself –Automatic Detection of device buffers via CUDA’s Unified Virtual Addressing (UVA) –Currently supported only by CUDA and newer NVIDIA GPUs –Query cost is high and added to every operation (even CPU-CPU) MPI-ACC Application Programming Interface (API) 21 MPI-ACC Interface Cost (CPU-CPU)

Results: Impact of Pipelined Data Transfers Contact: Ashwin M. Aji 22 29% better than manual blocking 14.6% better than manual non-blocking

MPI-ACC: Integrated and Optimized Data Movement  MPI-ACC: integrates accelerator awareness with MPI for all data movement –Programmability/Productivity: supports multiple accelerators and prog. models (CUDA, OpenCL) –Performance: allows applications to portably leverage system-specific and vendor-specific optimizations 23 Contact: Ashwin M. Aji GPU CPU GPU ……… Network … … Intra-node data movement optimizations [in collaboration with NCSU, ANL, VT] Shared memory protocol [Ji AsHES’12] Direct DMA protocol [Ji HPCC’12]

Accelerator-Based Supercomputers (June 2012) Contact: Ashwin M. Aji 24

Accelerator-Based Supercomputers 25 Accelerator Vendors/Models NVIDIA 2050/2070/2090 IBM PowerXCell 8i AMD/ATI GPU Clearspeed CSX600 Intel MIC Contact: Ashwin M. Aji

Background: CPU-GPU Clusters  CPU-GPU clusters –Prog. Models: MPI+CUDA, MPI+OpenCL, and so on… –Multiple GPUs per node –Non-uniform node architecture and topology  New programmability and performance challenges for programming models and runtime systems 26 Keeneland Node Architecture Contact: Ashwin M. Aji