Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Similar presentations


Presentation on theme: "High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy."— Presentation transcript:

1 High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy (PNNL); and Vinod Tipparaju (AMD) Global Arrays (GA) is popular high-level parallel programming model that provides data and computation management facilities to the NWChem computational chemistry suite. GA's global-view data model is supported by the ARMCI partitioned global address space runtime system, which traditionally is implemented natively on each supported platform in order to provide the best performance. The industry standard Message Passing Interface (MPI) also provides one-sided functionality and is available on virtually every supercomputing system. We present the first high- performance, portable implementation of ARMCI using MPI one- sided communication. We interface the existing GA infrastructure with ARMCI-MPI and demonstrate that this approach reduces the amount of resources consumed by the runtime system, provides comparable performance, and enhances portability for applications like NWChem. Global Arrays GA allows the programmer to create large, multidimensional shared arrays that span the memory of multiple nodes in a distributed memory system. The programmer interacts with GA through asynchronous, one-sided Get, Put, and Accumulate data movement operations as well as through high-level mathematical routines provided by GA. Accesses to the a global array are specified through high-level array indices; GA translates these high-level operations into low-level communication operations that may target multiple nodes depending on the underlying data distribution. GA's low-level communication is managed by the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication runtime system. Porting and tuning of GA is accomplished through ARMCI and traditionally involves natively implementing its core operations. Because of its sophistication, creating an efficient and scalable implementation of ARMCI for a new system requires expertise in that system and significant effort. In this work, we present the first implementation of ARMCI on MPI’s portable one-sided communication interface. MPI Global Memory Regions Allocation ID01…N 00x6b70x7af0x0c1 10x9d90x0 20x6110x38b0x659 30xa630x3f80x0 ARMCI Absolute Process Id MPI global memory reg- ions (GMR) align MPI's remote memory access (RMA) window seman- tics with ARMCI's PGAS model. GMR provides the following functionality: MPI One-Sided Communication ARMCI_PutS GA_Put 0 2 1 3 Noncontiguous Communication GA operations frequently access array regions that are not contiguous in memory. These operations are translated into ARMCI strided and generalized I/O vector communica- tion operations. We have provided four implementations for ARMCI’s strided interface: Direct: MPI subarray datatypes are generated to describe local and remote buffers; a single operation is issued. Communication Performance Contiguous Transfer BandwidthNoncontiguous Transfer Bandwidth NWChem Performance Performance Evaluation 1.Alloction and mapping of shared segments. 2.Translation between ARMCI shared addresses and MPI windows and window displacements 3.Translation between ARMCI absolute ranks used in communication and MPI window ranks 4.Management of MPI RMA semantics for direct load/store accesses to shared data. Direct load/store access to shared regions in MPI must be protected with MPI_Win_lock/unlock operations. 5.Guarding of shared buffers: When a shared buffer is provided as the local buffer for an ARMCI communication operation, GMR must perform appropriate locking and copy operations in order to preserve multiple MPI rules, including the constraint that a window cannot be locked for more than one target at any given time. Unlock Rank 0 Rank 1 Get(Y) Put(X) Lock Completion Public Copy Public Copy Private Copy Private Copy MPI provides an interface for one-sided communication and encapsulates shared regions of memory in window objects. Windows are used to guarantee data consistency and portability across a wide range of platforms, including those with noncoherent memory access. This is accomplished through public and private copies which are synchronized through access epochs. To ensure portability, windows have several important properties: 1.Concurrent, conflicting accesses are erroneous 2.All access to the window data must occur within an epoch 3.Put, get, and accumulate operations are non-blocking and occur concurrently in an epoch 4.Only one epoch is allowed per window at any given time Experiments were conducted on a cross-section of leader- ship class architectures Mature implementations of ARMCI on BG/P, InfiniBand, and XT (GA 5.0.1) Development implementation of ARMCI on Cray XE6 NWChem performance study: computational modeling of water pentamer SystemNodesCoresMemoryInterconnectMPI IBM BG/P (Intrepid) 40,9601 x 42 GB3D TorusIBM MPI InfiniBand (Fusion) 3202 x 436 GBIB QDRMVAPICH2 1.6 Cray XT (Jaguar) 18,6882 x 616 GBSeastar 2+Cray MPI Cray XE6 (Hopper) 6,3922 x 1232 GBGeminiCray MPI Blue Gene/PCray XE Performance and scaling is comparable on most architectures; MPI tuning would improve overheads. Performance of ARMCI-MPI on Cray XE6 is 30% higher and scaling is better Native ARMCI on Cray XE6 is a work in progress; ARMCI-MPI enables early use of GA/NWChem on such new platforms InfiniBand ClusterCray XE ARMCI-MPI performance is comparable to native, but tuning is needed, especially on Cray XE Contiguous performance of MPI-RMA is nearly twice that of native ARMCI on Cray XE Direct strided is best, except on BG/P where batched is better due to pack/unpack throughput IOV-Conservative: Strided transfer is converted to an I/O vector and each operation is issued within its own epoch. IOV-Batched: Strided transfer is converted to an I/O vector and N operations are issued within an epoch. IOV-Direct: Strided transfer is converted to an I/O vector and an MPI indexed datatype is generated to describe both the local and remote buffers; a single operation is issued Additional Information: J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju, “Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication," Preprint ANL/MCS-P1878-0411, April 2011. Online: http://www.mcs.anl.gov/publications/paper_detail.php?id=1535 Epoch Allocation Metadata MPI_Win window; intsize[nproc]; ARMCI_Group grp; ARMCI_Mutex rmw; … Allocation Metadata MPI_Win window; intsize[nproc]; ARMCI_Group grp; ARMCI_Mutex rmw; …


Download ppt "High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy."

Similar presentations


Ads by Google