OFA Openframework WG SHMEM/PGAS Developer Community Feedback 2/2014.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

HW/Study Guide. Synchronization Make sure you understand the HW problems!
MPI Requirements of the Network Layer Presented to the OpenFabrics libfabric Working Group January 21, 2014 Community feedback assembled by Jeff Squyres,
Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
OFA Openframework WG SHMEM/PGAS Feedback Worksheet 1/27/14.
490dp Synchronous vs. Asynchronous Invocation Robert Grimm.
Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
MPI3 Hybrid Proposal Description
Synchronization and Communication in the T3E Multiprocessor.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Managed by UT-Battelle for the Department of Energy MPI for MultiCore and ManyCore Galen Shipman Oak Ridge National Laboratory June 4, 2008.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Operating Systems Lecture 7 OS Potpourri Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software.
CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
University of Washington Today Finished up virtual memory On to memory allocation Lab 3 grades up HW 4 up later today. Lab 5 out (this afternoon): time.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
SHMEM/PGAS Developer Community Feedback for OFI Working Group #OFADevWorkshop Howard Pritchard, Cray Inc.
Programmability Hiroshi Nakashima Thomas Sterling.
Full and Para Virtualization
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CSE 351 Dynamic Memory Allocation 1. Dynamic Memory Dynamic memory is memory that is “requested” at run- time Solves two fundamental dilemmas: How can.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.
HPX-5 ParalleX in Action
Last Class: Introduction
Memory Management.
Processes and threads.
Distributed Shared Memory
Chapter 4: Multithreaded Programming
Fabric Interfaces Architecture – v4
Lecture 5: GPU Compute Architecture
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Chapter 9: Virtual-Memory Management
Lecture 5: GPU Compute Architecture for the last time
Request ordering for FI_MSG and FI_RDM endpoints
CS703 - Advanced Operating Systems
Lecture Topics: 11/1 General Operating System Concepts Processes
CSE 451: Operating Systems Autumn 2005 Memory Management
Chapter 2: Operating-System Structures
Chapter 4 Threads, SMP, and Microkernels
Chapter 4: Threads.
Chapter 2: Operating-System Structures
Support for Adaptivity in ARMCI Using Migratable Objects
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

OFA Openframework WG SHMEM/PGAS Developer Community Feedback 2/2014

This material was assembled with help of the following organizations/people Los Alamos National Lab Latchesar Ionkov Ginger Young Oak Ridge National Lab Steve Poole Pavel Shamis Sandia National Lab Brian Barrett Intel David Addison Charles Archer Sayantan Sur Mellanox Liran Liss Cray Monika ten Bruggencate Howard Pritchard (scribe) Input was also obtained from Paul Hargrove (LBL) and Jeff Hammond (ANL), and others. 2

Outline Brief SHMEM and PGAS intro Feedback from developers concerning API requirements –Endpoint considerations –Memory registration –Remote memory references –Collectives –Active messages 3

(Open)SHMEM 4 symmetric heap PE NPES - 1 Data segments Library based one-sided program model All ranks (PEs) in the job run the same program (SPMD) Only objects in symmetric regions – data segment(s), symmetric heap – are guaranteed to be remotely accessible Various vendor specific variations/extensions Somewhat archaic interface (think Cray T3D), being modernized as part of OpenSHMEM effort vaddr space

UPC 5 Compiler based program model – ‘c’ with extensions Each thread has affinity to a certain chunk of shared memory Objects declared as shared are allocated out of shared memory Objects can be distributed across the chunks of shared memory in various ways – blocked, round robin, etc. Collective operations, locks, etc. Like MPI, evolving over time Thread may map to SHMEM PE rank enumeration Thread 0 2 THREADS - 1 vaddr space 1 shared address space private address space x[T*3] x[(T*3)+1] x[(T*3)+2] x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[(T-1)*3] x[(T-1)*3+1] x[(T-1)*3+2] shared [3] int x[(T*3)+3] where T = THREADS Example for array X as declared below:

Fortran 2008 (CoArrays or CAF) 6 Also compiler based one-sided program model CoArray construct part of Fortran 2008 standard Like SHMEM, currently supports only SPMD model Address space model closer to SHMEM than UPC But there is a significant complication with Fortran 2008 (and likely in future versions of UPC). Basically with Fortran 2008 not a clean separation between "shared" and "private" address spaces. See next slide.

UPC/Fortran 2008(2) 7. #include. long *shared sh_ptr[THREADS]; long *my_mallocd_array = NULL; long *thread_zeros_array = NULL; if (MYTHREAD == 0) { my_mallocd_array = (long *)malloc(1000 * sizeof(long)); sh_ptr[0] = my_mallocd_array; } upc_barrier(100); if (MYTHREAD == 1) { thread_zeros_array = sh_ptr[0]; for (int i=0;i<10;i++) thread_zeros_array[i] = 0xdeadbeef; } upc_barrier(101); if (MYTHREAD == 0) { for (int i=0;i<10;i++) assert(my_mallocd_array[i] == 0xdeadbeef); } Array of shared pointers to private long (likely preregistered with NIC) This is not currently a legal UPC code example, but Fortran 2008 equivalent is. Just did not want to use Fortran for example. private pointers to private long Compiler translates this into an underlying RDMA get(load), including retrieval of mem reg info, etc. Compiler translates into RDMA put

Different Views of PGAS – implementing and using 8 Active Message Based Implementations Pure (almost) one- sided implementations Productivity more important than performance Expect SHMEM, etc. to beat MPI on performance implementer viewpoints user viewpoints Want to use SHMEM, etc. inside MPI app for performance reasons

Some other animals in the one- sided program model zoo 9 Charm++ proxy ARMCI proxy

Community Feedback 10

SHMEM/PGAS API Requirements – caveats and disclaimers Assume significant degree of overlap with needs of MPI community, e.g. –Similar needs with respect to supporting fork –Similar needs concerning munmap if memory registration/deregistration needs to be managed explicitly by the SHMEM/PGAS runtime What are covered here are API requirements more particular to SHMEM/PGAS and similar one-sided program models 11

API Endpoint Considerations Endpoint memory usage needs to be scalable Low overhead mechanism for enumeration of endpoints, i.e. "ranks" rather than lids, etc. It would be great to have connectionless (yet reliable)-style endpoints and interfaces, e.g. method in struct fi_ops_rma looks promising If connected-style endpoints are the only choice, methods to do both on-demand and full-wire up efficiently would be great, e.g. both good performance of the struct fi_ops_av insert method as well as the struct fi_ops_cm connect/listen/accept methods. API that supports "thread hot" thread safety model for endpoints. 12 int (*writemsg)(fid_t fid, const struct fi_msg_rma *msg, uint64_t flags);

Memory Registration API Requirements (1) Necessary evil? –Some say yes if needed for good performance Scalability of memory registration is a very important property –API should be designed to allow for cases where only local memory registration info is required to access remote memory –Does this lead to security issues? Have to have some protection mechanism. An idea - let application supply “r_key” values to use - this idea is already in libfabric API - struct fi_ops_mr mr_reg method On-demand paging option requirements –Want flexibility to do on-demand paging when requested, but may also want “pinned” pages method, specified by application –PGAS compiler has to have control of memory allocation to avoid needing this – but often compiler does have control of heap allocator, etc. –Fortran 2008, possible future versions of UPC may still find this useful –Maybe also be helpful for library based one-sided models, esp. if restrictions on what memory is remotely accessible become more relaxed 13

Memory Registration API Requirements (2) Fortran 2008 (CoArray) in particular could benefit from ability to register large amounts of virtual memory that may be only sparsely populated Portals4 non-matching interface has an option which supports this if implementation supports PTL_TARGET_BIND_INACCESSIBLE Compromise functionality where a large VM region can be “pre- registered” to prep I/O MMU and device driver, but does not pin pages. A second function would be called to pin/register segments of the region when remote memory access is needed. Growable (both down and up) registrations would also be useful (mmap example) 14

Small remote memory reference API requirements – perf and ops PGAS compilers implemented directly on top of native RDMA functionality need an API that can deliver high performance for small non-blocking PUTs (stores) and GETs (loads). Typical remote memory accesses much smaller than for MPI programs and many more of them Atomic memory ops are important (see later slide) Put with various completion notification mechanisms: –“signaling” put where a flag value is written by hw to a given location at target after all of the put data is visible in target memory –atomic update option rather than simple flag write for remote notification –Sender side notification for local completion (safe to reuse buffer) and global completion (data has at least reached global ordering point at target) –Pointed out that global completion (not just global ordering point) can also be ascertained by initiator via any RDMA transaction that results in a non-posted request (RDMA get) at the target, at least for for PCI-e based hw. Small partially blocking put requirement (till safe to reuse local buffer) –Shmem on top of portals4/non-blocking puts semantics example – cost of implementing partially blocking on top of non-blocking? 15

Small remote memory reference API requirements - ordering PGAS compilers in particular have a special ordering requirement: Many PGAS compilers can benefit greatly from hardware which provides protection against WAW, WAR, and RAW hazards for remote memory references from a given initiator to a given target and address in the target’s address space. 16. #include. void update_shared_array(shared long *g_array, long *local_data, int *g_idx, int nupdates) { int i; for (i=0;i<nupdates;i++) { g_array[g_idx[i]] += local_data[i]; } upc_barrier(0); if(MYTHREAD == 0) { printf(“Done with work\n”); } app knows g_idx has no overlap between different threads, but possible repeat of indices for one thread, so no need for locks, etc. in update loop Problem for compiler is here. What if for some n,m it is the case that g_idx[n] == g_idx[m]. See notes.

Small remote memory reference API requirements – Atomic memory ops Rich set of AMOs also useful – need more than FADD and CSWAP. Multi-element AMOs for active message support, etc. Nice to have AMOs that can support MCS lock algorithm at scale – implies possibly needing 128 bit AMOs At least two kinds of AMO performance characteristics: –Low latency but reliable (either fail or succeed, result being reported back to initiator). This allows use of locks, queues, etc. without giving up on resilience to transient network errors. –“At memory” computation, but only need a good enough answer. Throughput more important than reliability. Example is GUPS. 32/16 bit granularity ops would be useful in addition to 64 bit and possibly 128 bit. AMO cache (on NIC) coherency issues? May need functionality in API for this. Maybe sync method on domain object could be used? 17

Small remote memory reference API requirements – AMO survey 18 OperationIBM BG/QCray XCQuadrics SHMEM IB swapxxx- comp & swapxxxx masked swapxx (AFAX)x- addxxxx bitwise orxx-- bitwise andxx-- comp & orx--- comp & addx--- comp & orx--- comp & andx--- comp & xorx--- min-x-- max-x--

Remote memory reference API requirements – completion notification Lightweight completion notification is very desirable, especially for PUTs. –FI_EC_COUNTER type event collector object fits here? Ideally allow for batching of groups of PUT/GET requests with a single completion notification at the initiator. Get completion information at target of the get operation: –Data in the get buffer has been read and heading over the wire, i.e. target can reuse the buffer –Data has arrived in initiator’s memory 19

Large RDMA transfers Not so different from MPI Need RDMA read and write Option for fences between transactions Per transfer network ordering options would be great Notification requirements similar to those for small remote memory references For RDMA write - piggyback message data (more than 32 bits of imm data) coming along with bulk data – Gasnet request – handler invocation 20

Collectives Would be nice to not have to reinvent the wheel for multiple, often concurrently used program models, e.g. app using SHMEM and MPI –A lower level common interface for frequently used collective operations – barrier, reductions, coalesce (allgather), alltoall –Flat would be better, not have to do special on-node (within a cache coherent domain) operations within the SHMEM/PGAS implementation //TODO: nothing defined in libfabric yet for collectives 21

Active Message Support Two general types of uses –Lower performance need for RPC-like capabilities implied by some types of operations, e.g. upc_global_alloc –Higher performance needed for PGAS languages implemented using an Active Message paradigm, as well as other active message based program models like Charm++ API should support sending of requests to pre-initialized queues, for which the target has registered callback functions to process the data sent from initiator. Payload can be restricted to small size ~256 bytes or less. Initiator of message should get response back that message has arrived, optionally that message has been consumed by callback function Implementation needs to be able to handle transient network errors, message is delivered once and only once to target Some applications may require ordering of messages from a given initiator to given target, would be good to be able to specify this at queue initialization. Flow control is important. PAMI is a good example of an interface that meets many of these requirements 22

References BM Parallel Environment Runtime Edition Version 1 Release 2: PAMI Programming Guide (SA )BM Parallel Environment Runtime Edition Version 1 Release 2: PAMI Programming Guide (SA ) Using the GNI and DMAPP APIs

Thank You #OFADevWorkshop

Backup Material 25

( Open)SHMEM - example 26. #include long target_array[1000]; int main(int argc,char **argv) { int i, my_pe, n_pes, r_neighbor, l_neighbor; int n_elems; long *source_array; shmem_init(); n_elems = sizeof(target_array)/sizeof(long); my_pe = shmem_my_pe(); n_pes = shmem_n_pes(); r_neighbor = (my_pe + 1) % n_pes; l_neighbor = (my_pe + n_pes - 1) % n_pes; source_array = (long *)malloc(sizeof(long) * n_elems); for (i=0;i<n_elems;i++) source_array[i] = (long) my_pe; shmem_barrier_all(); shmem_put64(target_array,source_array,1000,r_neighbor); shmem_barrier_all(); for (i=0;i<n_elems;i++) if(target_array[i] != l_neighbor) printf("something's wrong\n"); shmem_finalize(); return(0); } symmetric address – array in data segment Barrier – process and memory synchronization vendor specific

UPC 27 Compiler based program model – ‘c’ with extensions Each thread (PE in SHMEM model) has affinity to a certain chunk of shared memory Objects declared as shared are allocated out of shared memory Objects can be distributed across the chunks of shared memory in various ways – blocked, round robin, etc. Collective operations, locks, etc. Like MPI, evolving over time