High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Slides:

Advertisements

Similar presentations

Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication James Dinan, Pavan Balaji, Jeff Hammond, Sriram Krishnamoorthy, and Vinod Tipparaju.

Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.

IBM’s X10 Presentation by Isaac Dooley CS498LVK Spring 2006.

Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

Distributed Processing, Client/Server, and Clusters

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

3.5 Interprocess Communication

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures Min Si PhD student at University of Tokyo, Tokyo, Japan Advisor : Yutaka.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. The experimental resources for this research were.

Mapping Techniques for Load Balancing

Hossein Bastan Isfahan University of Technology 1/23.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

MPI3 Hybrid Proposal Description

Inter-process Communication and Coordination Chaitanya Sambhara CSC 8320 Advanced Operating Systems.

Synchronization and Communication in the T3E Multiprocessor.

A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Managed by UT-Battelle for the Department of Energy MPI for MultiCore and ManyCore Galen Shipman Oak Ridge National Laboratory June 4, 2008.

Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.

SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.

Min Si[1][2], Antonio J. Peña[1], Jeff Hammond[3], Pavan Balaji[1],

Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.

Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.

Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.

Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy (PNNL); and Vinod Tipparaju (AMD) Global Arrays (GA) is popular high-level parallel programming model that provides data and computation management facilities to the NWChem computational chemistry suite. GA's global-view data model is supported by the ARMCI partitioned global address space runtime system, which traditionally is implemented natively on each supported platform in order to provide the best performance. The industry standard Message Passing Interface (MPI) also provides one-sided functionality and is available on virtually every supercomputing system. We present the first high- performance, portable implementation of ARMCI using MPI one- sided communication. We interface the existing GA infrastructure with ARMCI-MPI and demonstrate that this approach reduces the amount of resources consumed by the runtime system, provides comparable performance, and enhances portability for applications like NWChem. Global Arrays GA allows the programmer to create large, multidimensional shared arrays that span the memory of multiple nodes in a distributed memory system. The programmer interacts with GA through asynchronous, one-sided Get, Put, and Accumulate data movement operations as well as through high-level mathematical routines provided by GA. Accesses to the a global array are specified through high-level array indices; GA translates these high-level operations into low-level communication operations that may target multiple nodes depending on the underlying data distribution. GA's low-level communication is managed by the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication runtime system. Porting and tuning of GA is accomplished through ARMCI and traditionally involves natively implementing its core operations. Because of its sophistication, creating an efficient and scalable implementation of ARMCI for a new system requires expertise in that system and significant effort. In this work, we present the first implementation of ARMCI on MPI’s portable one-sided communication interface. MPI Global Memory Regions Allocation ID01…N 00x6b70x7af0x0c1 10x9d90x0 20x6110x38b0x659 30xa630x3f80x0 ARMCI Absolute Process Id MPI global memory regions (GMR) align MPI's remote memory access (RMA) window semantics with ARMCI's PGAS model. GMR provides the following functionality: MPI One-Sided Communication ARMCI_PutS GA_Put Noncontiguous Communication GA operations frequently access array regions that are not contiguous in memory. These operations are translated into ARMCI strided and generalized I/O vector communication operations. We have provided four implementations for ARMCI’s strided interface: Direct: MPI subarray datatypes are generated to describe local and remote buffers; a single operation is issued. Communication Performance Contiguous Transfer BandwidthNoncontiguous Transfer Bandwidth NWChem Performance Performance Evaluation 1.Alloction and mapping of shared segments. 2.Translation between ARMCI shared addresses and MPI windows and window displacements 3.Translation between ARMCI absolute ranks used in communication and MPI window ranks 4.Management of MPI RMA semantics for direct load/store accesses to shared data. Direct load/store access to shared regions in MPI must be protected with MPI_Win_lock/unlock operations. 5.Guarding of shared buffers: When a shared buffer is provided as the local buffer for an ARMCI communication operation, GMR must perform appropriate locking and copy operations in order to preserve multiple MPI rules, including the constraint that a window cannot be locked for more than one target at any given time. Unlock Rank 0 Rank 1 Get(Y) Put(X) Lock Completion Public Copy Public Copy Private Copy Private Copy MPI provides an interface for one-sided communication and encapsulates shared regions of memory in window objects. Windows are used to guarantee data consistency and portability across a wide range of platforms, including those with noncoherent memory access. This is accomplished through public and private copies which are synchronized through access epochs. To ensure portability, windows have several important properties: 1.Concurrent, conflicting accesses are erroneous 2.All access to the window data must occur within an epoch 3.Put, get, and accumulate operations are non-blocking and occur concurrently in an epoch 4.Only one epoch is allowed per window at any given time Experiments were conducted on a cross-section of leader- ship class architectures Mature implementations of ARMCI on BG/P, InfiniBand, and XT (GA 5.0.1) Development implementation of ARMCI on Cray XE6 NWChem performance study: computational modeling of water pentamer SystemNodesCoresMemoryInterconnectMPI IBM BG/P (Intrepid) 40,9601 x 42 GB3D TorusIBM MPI InfiniBand (Fusion) 3202 x 436 GBIB QDRMVAPICH2 1.6 Cray XT (Jaguar) 18,6882 x 616 GBSeastar 2+Cray MPI Cray XE6 (Hopper) 6,3922 x 1232 GBGeminiCray MPI Blue Gene/PCray XE Performance and scaling is comparable on most architectures; MPI tuning would improve overheads. Performance of ARMCI-MPI on Cray XE6 is 30% higher and scaling is better Native ARMCI on Cray XE6 is a work in progress; ARMCI-MPI enables early use of GA/NWChem on such new platforms InfiniBand ClusterCray XE ARMCI-MPI performance is comparable to native, but tuning is needed, especially on Cray XE Contiguous performance of MPI-RMA is nearly twice that of native ARMCI on Cray XE Direct strided is best, except on BG/P where batched is better due to pack/unpack throughput IOV-Conservative: Strided transfer is converted to an I/O vector and each operation is issued within its own epoch. IOV-Batched: Strided transfer is converted to an I/O vector and N operations are issued within an epoch. IOV-Direct: Strided transfer is converted to an I/O vector and an MPI indexed datatype is generated to describe both the local and remote buffers; a single operation is issued Additional Information: J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju, “Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication," Preprint ANL/MCS-P , April Online: Epoch Allocation Metadata MPI_Win window; intsize[nproc]; ARMCI_Group grp; ARMCI_Mutex rmw; … Allocation Metadata MPI_Win window; intsize[nproc]; ARMCI_Group grp; ARMCI_Mutex rmw; …