Advancing the “Persistent” Working Group MPI-4 Tony Skjellum June 5, 2013.

Slides:

Advertisements

Similar presentations

MPI-3 Persistent WG (Point-to-point Persistent Channels) February 8, 2011 Tony Skjellum, Coordinator MPI Forum.

Advertisements

Persistence Working Group More Performance, Predictability, Predictability June 16, 2010 Tony Skjellum, UAB.

System Integration and Performance

1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)

1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;

Concurrency: Mutual Exclusion and Synchronization Chapter 5.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

SOAP Lee Jong-uk. Introduction What is SOAP? The features of SOAP The structure of SOAP SOAP exchange message model & message Examples of SOAP.

Datorteknik BusInterfacing bild 1 Bus Interfacing Processor-Memory Bus –High speed memory bus Backplane Bus –Processor-Interface bus –This is what we usually.

1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.

Module 3.4: Switching Circuit Switching Packet Switching K. Salah.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Chapter 4 Network Layer slides are modified from J. Kurose & K. Ross CPE 400 / 600 Computer Communication Networks Lecture 14.

10 - Network Layer. Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving.

ICE Jonathan Rosenberg dynamicsoft. Issue 1: Port Restricted Flow This case does not work well with ICE right now Race condition –Works if message 13.

1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.

EE 4272Spring, 2003 Chapter 9: Circuit Switching Switching Networks Circuit-Switching Networks Circuit-Switching Concept  Space-Division Switching  Time-Division.

Introduction to Management Information Systems Chapter 5 Data Communications and Internet Technology HTM 304 Fall 07.

CS533 - Concepts of Operating Systems

MPI Point-to-Point Communication CS 524 – High-Performance Computing.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

CS603 Communication Mechanisms 14 January Types of Communication Shared Memory Message Passing Stream-oriented Communications Remote Procedure Call.

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Channel Interface Takeshi Nanri (Kyushu Univ., Japan) and Shinji Sumimoto (Fujitsu Ltd.) Dec Advanced Communication for Exa (ACE) project supported.

Section 4 : The OSI Network Layer CSIS 479R Fall 1999 “Network +” George D. Hickman, CNI, CNE.

Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.

1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.

1 Microprocessor-based Systems Course 9 Design of the input/output interfaces (continue)

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Sharing Information across Congestion Windows CSE222A Project Presentation March 15, 2005 Apurva Sharma.

Connecting The Network Layer to Data Link Layer. ARP in the IP Layer The Address Resolution Protocol (ARP) The Address Resolution Protocol (ARP) Part.

Geneva, Switzerland, 11 June 2012 Switching and routing in Future Network John Grant Nine Tiles

Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)

Rebooting the “Persistent” Working Group MPI-3.next Tony Skjellum December 5, 2012.

Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.

Using the Right Method to Collect Information IW233 Amanda Murphy.

1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.

1 Concurrency: Mutual Exclusion and Synchronization Chapter 5.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Computer Architecture Lecture 32 Fasih ur Rehman.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.

FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.

Persistent Collective WG December 9, Basics Mirror regular nonblocking collective operations For each nonblocking MPI collective (including neighborhood.

CS440 Computer Networks 1 Packet Switching Neil Tang 10/6/2008.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.

Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

Nguyen Thi Thanh Nha HMCL by Roelof Kemp, Nicholas Palmer, Thilo Kielmann, and Henri Bal MOBICASE 2010, LNICST 2012 Cuckoo: A Computation Offloading Framework.

1587: COMMUNICATION SYSTEMS 1 Wide Area Networks Dr. George Loukas University of Greenwich,

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Computer Science 320 Introduction to Cluster Computing.

MPI: Message Passing Interface An Introduction S. Lakshmivarahan School of Computer Science.

Requirements for the Resilience of Control Plane in GMPLS (draft-kim-ccamp-cpr-reqts-00.txt) Young Hwa Kim CCAMP WG (59 th IETF) Apr.04,

Deterministic Communication with SpaceWire

Definition of Distributed System

Parallel Programming By J. H. Wang May 2, 2017.

Fabric Interfaces Architecture – v4

Last Class: RPCs and RMI

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Advanced Operating Systems

Chapter 9: Virtual-Memory Management

Distributed Snapshots

Presentation transcript:

Advancing the “Persistent” Working Group MPI-4 Tony Skjellum June 5, 2013

Persistence Goals Cross-cutting use of “planned transfers” – Point to point – Collective – Generalized requests Allow program to express – Locality, static use of resources – Connectivity of sends/receives Purpose: Improve speed and scalability for regular, static communication

Concepts review Drew analogies from MPI/RT Point-to-point Channels vs half-channels Slack-1 “bind a send” to “a receive” Goal: cut the # of instructions in critical path Reuses “Send_init” and “Recv_init” Open issues – Multiple slack channels – Rebinding when some arguments change

Concepts review For all non-blocking collective, we can define “persistent non-blocking collective” The approach in MPI/RT was to have collective channels as well as point to point Again, the goal is to allow for static resource management (memory, network, etc), algorithm selection in advance, and single- mode performance of perations

Additional – Differentiated Service Each communicator is a separate “MPI fabric” or independently ordered overlay network that is all-to-all Allow communicators to offer differentiated service – No tags matching – No wildcards – Guarantee of posting (F mode), S mode … – Strong or weak or no progress – Independent buffering rules and even short/long cutoffs – Degree of correctness (e.g., OK to make data errors, vs. MPI-1 compliant “all correct”) – QoS concepts could be added as well – Subset of MPI that is usable Goal: allow each communicator fabric to operate at optimal performance and scalability and even to make error/fault choices to support FT where appropriate Use MPI_Comm_dup and MPI_Comm_create as prototypes for generating such functions

Steps that were for March (doing in part now in June) Revive proposals concerning the pt2pt specific calls, and general framework for collective See if anyone else is interested in helping on committee Get straw indication on differentiated service concept Review status of generalized requests and left over ideas in MPI-3, see if persistent generalized requests are useful Entertain any other ideas about “making” MPI pt2pt and collective faster by using planned transfer

May 2011 Document We have a 12-page document from 2011 that we held back while MPI-3 was finishing It contains basic proposals We’d like to refine these and update the May 2011 document after this working group meeting today We’d like to flesh out any concerns not yet fully understood So, first a couple of slides, then we go through the document, and capture action items and ideas and next steps!

Getting back to a specific pt2pt proposal (#1) Very short numbers of instructions to do a persistent send/receive pair – MPI_Send_init() – MPI_Recv_init() They are not enough, because they don’t lock down all the resources (they are ‘half’ channels) Motivation: “Darius Buntinas measurement” – 250+ instructions in shared memory…. Must be reduced

Requirements The minimum number of instructions in critical path to launch a send for “Nth” time after “setup” Arguments frozen after first time – All – Most (except starting address, length) Want the “Start” to track right to “kick DMA” or “kick transfer offload engine”… 1 OS bypass / application bypass instruction Makes regular programs finer grain in O/H & Latency Provides its own communication space/order/matching (outside of communicator) Use all the features you can use in an MPI_Isend; simple datatypes may be faster in practice Easy changes to existing MPI programs that are already static/data parallel (like 2D Life)

Where we got into the weeds before How do we create the channel (1 easy, what a bout several?) How many messages in flight? At least 1. How do we avoid extra signaling that makes this like stop-and-wait at the LLC layer? More than 1 to fill virtual channel. Let’s go through the document to refresh and decide next steps…

Q&A