Download presentation
Presentation is loading. Please wait.
Published byJustina Gordon Modified over 9 years ago
1
Combining Shared and Distributed Memory Models Approach and Evolution of the Global Arrays Toolkit Jarek Nieplocha Robert Harrison, Manoj Kumar Krishnan Bruce Palmer, Vinod Tipparaju, Harold Trease Pacific Northwest National Laboratory
2
Overview zBackground zProgramming Model zCore Capabilities zRecent Work zConclusions
3
Global address space & One-sided communication (0xf5670,P0) (0xf32674,P5) P0 P1P2 collection of address spaces of processes in a parallel job (address, pid) message passing P1 P0 receive send But not P1P0 put one-sided communication SHMEM, ARMCI, MPI-2-1S Communication model
4
Global Arrays Data Model zshared memory model in context of distributed dense arrays zcomplete environment for parallel code development zcompatible with MPI zdata locality control similar to distributed memory/message passing model zextensible single, shared data structure/ global indexing e.g., A(4,3) rather than buf(7) on task 2 Physically distributed data
5
Global Array Model of Computations compute/update local memory Shared Object copy to local memory 1-sided communication get Shared Object copy to shared object local memory 1-sided communication put
6
Example: Matrix Multiply local buffers on the processor global arrays representing matrices = = ga_get ga_acc dgemm
7
Comparison to other models
8
Structure of GA Message Passing process creation, run-time environment ARMCI portable 1-sided communication put,get, locks, etc distributed arrays layer memory management, index translation application interfaces Fortran 77, C, C++, Python system specific interfaces LAPI, GM/Myrinet, threads, VIA,..
9
Core Capabilities zDistributed array library ydense arrays 1-7 dimensions yfour data types: integer, real, double precision, double complex yglobal rather than per-task view of data structures yuser control over data distribution: regular and irregular zCollective and shared-memory style operations zInterfaces to third party parallel numerical libraries yPeIGS, Scalapack, SUMMA, Tao (under development) xexample: to solve a linear system using LU factorization call ga_lu_solve(g_a, g_b) instead of call pdgetrf(n,m, locA, p, q, dA, ind, info) call pdgetrs(trans, n, mb, locA, p, q, dA,dB,info)
10
Performance zPerformance model for "shared memory" data access yarray index translation e.g., 1.2 S on Linux/PIII yoverhead in one of more ARMCI put/get/... calls xdirect mapping to native RMA calls (e.g., 3 S on Cray T3E) or xsimple shared memory access (e.g., 0.3 S on Linux/PIII) or xmore complex due to the Active Message style implementations e.g. 12 (put) 37 (get) S on Linux/PIII with Myrinet
11
Applications Areas thermal flow simulation Visualization and image analysis electronic structure glass flow simulation material sciences molecular dynamics Others: financial security forecasting, astrophysics, geosciences biology
12
Major Milestones z1994 - 1 st public release of GA z1995 - Metacomputing (grid) extensions of GA z1996 - DRA, parallel I/O for GA programs developed z1997 - development of ARMCI started z1998 - GA rewritten to use ARMCI z1999 - GA 3.0 released, n-dimensional arrays z2000 - periodic one-sided operations z2001 - support for sparse data management z2002 - ghost cell operations, n-dim DRA
13
normal global array global array with ghost cells Ghost Cells Operations NGA_Create_ghosts - creates array with ghosts cells GA_Update_ghosts - updates with data from adjacent processors NGA_Access_ghosts - provides access to “local” ghost cell elements Embedded Synchronization - controlled by the user Multi-protocol implementation to match platform characteristics e.g., MPI+shared memory on the IBM SP, SHMEM on the Cray T3E
14
Update Algorithms zStandard algorithm: 3 D -1 messages 1st phase2nd phase zShift algorithm: 2D messages
15
Disk Resident Arrays zExtend GA model to disk ysystem similar to Panda (UIUC) but higher level APIs zProvide easy transfer of data between N-dim arrays stored on disk and stored in memory zUse when yArrays too big to store in core ycheckpoint/restart yout-of-core solvers global array disk resident array image processing application
16
Scalable Performance of DRA file systems I/O buffers SMP node processor
17
Common Component Architecture zA component model specifically designed for HPC yThree parts: Components, Ports and Frameworks zComponents ypeers yinteract through well-defined interfaces (ports) xIn OO Language a port is a class xIn Fortran, a port is a bunch of subroutines yA component may provide a port - implement the class/subroutines yAnother component may use that port – call methods in the port zFramework holds the components and compose them into “applications” zAdvantages: Reusable functionality, well-defined interfaces, etc.
18
Global Array CCA Component GA ComponentApplication Component CCA Services Port Instance: ”ga” Port Class : GlobalArrayPort getPort(“ga”) getPort(“dadf”) Port Instance: ”dadf” Port Class : DADFPort addProvidesPort(“ga”) addProvidesPort(“dadf”) registerUsesPort(“ga”) registerUsesPort(“dadf”) CCA Services Port Instance: ”ga” Port Class : GlobalArrayPort Port Instance: ”dadf” Port Class : DADFPort
19
Well-known ports CCA Services Well-known ports CCAFFEINE (CCA Framework) GlobalArrayPort DADFPort GoPort GA Component Application Component CCA Elements
20
GA++ zGA++ is a C++ class library for Global Arrays zGA++ classes: GAservices, GlobalArray GAservices gs; gs.initialize(); Global Array *ga=gs->createGA(…) …do work … ga->destroy(); gs.terminate(); GAservices gs; gs.initialize(); Global Array *ga=gs->createGA(…) …do work … ga->destroy(); gs.terminate(); GAservices Initialization, Termination, Inter-process Synchronization, etc GAservices Initialization, Termination, Inter-process Synchronization, etc GAservicesGlobalArray One-sided(get/put), collective array, Utility operations
21
Sparse data managment zSparse arrays can be implemented with y1-dimensional global arrays xNonzero elements, row and/or index arrays ySet of new operations that follow Thinking Machines CMSSL xEnumerate xPack/unpack xBinning (NxM mapping) x2-key binning/sorting functions xScatter_with_OP, where OP={+,min,max} xSegmented_scan_with_OP, where OP={+,min,max,copy} zAdopted in NWPhys/NWGrid AMR package zNext step - explicit sparse format yneed more application experience - too many degrees of freedom
22
Summary and Future zThe basic idea proven successful yefficient on a wide range of architectures xcore operations tuned for high performance ylibrary substantially extended but all original (1994) APIs preserved yincreasing number of application areas zOngoing and future work yLatency hiding on the low-end cluster networks by relaxed memory consistency and replication yAdvanced data structures xsparse arrays and hash tables yIncreased support for the HPC community standards xESI, CCA
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.