OFA Openframework WG SHMEM/PGAS Feedback Worksheet 1/27/14.

Slides:

Advertisements

Similar presentations

Advertisements

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

Programming Paradigms and languages

Umut Girit  One of the core members of the Internet Protocol Suite, the set of network protocols used for the Internet. With UDP, computer.

TCP/IP Protocol Suite 1 Chapter 13 Upon completion you will be able to: Stream Control Transmission Protocol Be able to name and understand the services.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

UDP - User Datagram Protocol UDP – User Datagram Protocol Author : Nir Shafrir Reference The TCP/IP Guide - ( Version Version.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

3.5 Interprocess Communication

Computer Science Lecture 2, page 1 CS677: Distributed OS Last Class: Introduction Distributed Systems – A collection of independent computers that appears.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.

I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p

I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.

New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.

Chapter 2. Creating the Database Environment

Fabtests – test framework ideas/suggestions Howard Pritchard – LANL LA-UR OFI WG F2F - 8/

Synchronization and Communication in the T3E Multiprocessor.

COM vs. CORBA Computer Science at Azusa Pacific University September 19, 2015 Azusa Pacific University, Azusa, CA 91702, Tel: (800) Department.

Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Fabric Interfaces Architecture Sean Hefty - Intel Corporation.

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

CE Operating Systems Lecture 14 Memory management.

Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.

OFA Openframework WG SHMEM/PGAS Developer Community Feedback 2/2014.

Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.

SHMEM/PGAS Developer Community Feedback for OFI Working Group #OFADevWorkshop Howard Pritchard, Cray Inc.

1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.

Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Manish Kumar,MSRITSoftware Architecture1 Remote procedure call Client/server architecture.

1. 2 Purpose of This Presentation ◆ To explain how spacecraft can be virtualized by using a standard modeling method; ◆ To introduce the basic concept.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

4P13 Week 9 Talking Points

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

File-System Management

Last Class: Introduction

Muen Policy & Toolchain

Memory Management.

Chapter 2 Memory and process management

Distributed Shared Memory

Chapter 2: System Structures

Fabric Interfaces Architecture – v4

Computer Engg, IIT(BHU)

Out-of-Process Components

CSI 400/500 Operating Systems Spring 2009

CMSC 611: Advanced Computer Architecture

Chapter 9: Virtual-Memory Management

Request ordering for FI_MSG and FI_RDM endpoints

Process-to-Process Delivery:

Main Memory Background Swapping Contiguous Allocation Paging

CS703 - Advanced Operating Systems

BIC 10503: COMPUTER ARCHITECTURE

Chapter 2: Operating-System Structures

Out-of-Process Components

Database System Architectures

Chapter 2: Operating-System Structures

Lecture 18: Coherence and Synchronization

Presentation transcript:

OFA Openframework WG SHMEM/PGAS Feedback Worksheet 1/27/14

Purpose of meeting Gather API requirements for SHMEM/PGAS, and more generally one-sided parallel program models for things like – Memory registration – Remote memory references – Collectives – Active message implementation considerations – Others (e.g. Topology)?

What is the Openframework WG Charter? Develop a more extensible and more generalized framework that allows applications to make best use of current/future RDMA network technologies Avoid fragmentation of the OFED software ecosystem into multiple vendor specific components – this is sort of happening now with PSM, MXM, FCA, etc.

Memory Registration API Requirements (1) Necessary evil? – Some say yes if needed for good performance Scalability is a very important property – API should be designed to allow for cases where only local memory registration info is required to access remote memory On-demand paging nice, but not essential (see next slide)

Memory Registration API Requirements (2) Fortran 2008 (CoArray) in particular could benefit from ability to register large amounts of virtual memory that may be only sparsely populated Portals4 non-matching interface has an option which supports this if implementation supports PTL_TARGET_BIND_INACCESSIBLE Compromise functionality where a large VM region can be “pre- registered” to prep I/O MMU and device driver, but does not pin pages. A second function would be called to pin segments of the region when remote memory access is needed. Growable (both down and up) registrations would also be useful

Small remote memory reference API requirements – perf and ops PGAS compilers implemented directly on top of native RDMA functionality need an API that can deliver high performance for small non-blocking PUTs (stores) and GETs (loads). Typical remote memory accesses much smaller than for MPI programs and many more of them Atomic memory ops are important (see later slide) Put with remote completion event

Small remote memory reference API requirements - ordering PGAS compilers in particular have special ordering requirements: – Most users (writers of UPC and Fortran 2008 CoArray apps) expect strict WAW and WAR to a given target from a given initiator unless explicitly requesting relaxed order via macros, etc. – Point to point ordering is required between remote memory transactions from a given initiator to the same address at a given target – Portals4 spec section describes in a nutshell the kind of ordering needed by typical PGAS compilers for small puts/gets/amos Ideally this type of ordering could be controlled on a per remote memory transaction basis (since language extensions do allow for users to specify relaxed ordering)

Small remote memory reference API requirements – completion notification Lightweight completion notification is very desirable, especially for PUTs(stores), not as important for GETs (loads). Ideally allow for batching of groups of PUT/GET requests with a single local completion notification Existing examples of this type functionality are – Concept of non-blocking implicit operations that complete with a single gsync call that only completes when all outstanding PUTs are globally visible – Quadrics SHMEM, Cray DMAPP – Portals4 counter like concept where a simple counter can be used to track completion of RDMA operations at the source. This API also has the option to deliver full event in the case of error to better help in recovery.

Small remote memory reference API requirements – Atomic memory ops Rich set of AMOs also useful – need more than FADD and CSWAP. Multi-element AMOs for active message support, etc. Nice to have AMOs that can support MCS lock algorithm at scale – implies possibly needing 128 bit AMOs Two kinds of AMO performance characteristics: – Low latency but reliable (either fail or succeed, result being reported back to initiator). This allows use of locks, queues, etc. without giving up on resilience to transient network errors. – “In memory” computation, but only need a good enough answer. Throughput more important than reliability. Example is Table Toy.

Large RDMA transfers Not so different from MPI ? Need RDMA read and write Option for fences between transaction Per transfer network ordering options would be great Notification both for local completion (can reuse buffer in the case of a write) and global completion

Collectives Would be nice to not have to reinvent the wheel for multiple, frequently concurrently used program models, e.g. app using SHMEM and MPI – A lower level common interface for frequently used collective operations – barrier, reductions, coalesce (allgather), alltoall – Flat would be better, not have to do special on-node (within a cache coherent) within the SHMEM/PGAS implementation

Active Message Support Two general types of uses – Lower performance need for RPC-like capabilities implied by some types of operations, e.g. upc_global_alloc – Higher performance needed for PGAS languages implemented using an Active Message paradigm, as well as other active message based program models like Charm++ API should support sending of requests to pre-initialized queues, for which the target has registered callback functions to process the data sent from initiator. Payload can be restricted to small size ~256 bytes or less. Initiator of message should get response back that message has arrived, optionally that message has been consumed by callback function Implementation needs to be able to handle transient network errors, message is delivered once and only once to target Some applications may require ordering of messages from a given initiator to given target, would be good to be able to specify this at queue initialization. Flow control is important. PAMI is a good example of an interface that meets many of these requirements

References BM Parallel Environment Runtime Edition Version 1 Release 2: PAMI Programming Guide (SA ) BM Parallel Environment Runtime Edition Version 1 Release 2: PAMI Programming Guide (SA ) Using the GNI and DMAPP APIs

Backup Material

Portals 4 Spec The default ordering semantics for Portals messages differ for short and long messages. The threshold between “short” and “long” is defined by two parameters, the maximum write-after-write size (max_waw_ordered_size) and the maximum write-after-read size (max_war_ordered_size). Both parameters are controlled by the desired and actual arguments of PtlNIInit(). Note that replies and acknowledgments do not require ordering. When one message that stores data (put, atomic) is followed by a message that stores data or retrieves data (put, atomic, get) from the same initiator to the same target and both messages are less than the max_waw_ordered_size in length, a byte from the second message that targets the same offset within the same LE (or ME) as a byte from the first message will perform its access after the byte from the first message. Similarly, when one message that retrieves data (get) is followed by a second message that stores data (put,atomic) from the same initiator to the same target and both messages are less than max_war_ordered_size in length, a byte from the second message that targets the same offset within the same LE (or ME) as a byte from the first message will perform its access after the byte from the first message. The order in which individual bytes of a single message are delivered is always unspecified. The order in which non-overlapping bytes of two different messages is not specified unless the implementation sets the PTL_TOTAL_DATA_ORDERING option in the actual features limits field. When total data ordering is provided and the short message constraints are met, the first message must be entirely delivered before any part of the second message is delivered. Support for the ordering of bytes between messages is an optional feature, since some implementations may be unable to provide such strict ordering semantics.