MPI-3 Persistent WG (Point-to-point Persistent Channels) February 8, 2011 Tony Skjellum, Coordinator MPI Forum.

Slides:



Advertisements
Similar presentations
Persistence Working Group More Performance, Predictability, Predictability June 16, 2010 Tony Skjellum, UAB.
Advertisements

Proposal (More) Flexible RMA Synchronization for MPI-3 Hubert Ritzdorf NEC–IT Research Division
MPI 2.2 William Gropp. 2 Scope of MPI 2.2 Small changes to the standard. A small change is defined as one that does not break existing correct MPI 2.0.
Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.
1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
MPI Message Passing Interface
1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.
Non-Blocking Collective MPI I/O Routines Ticket #273.
CSE341: Programming Languages Lecture 2 Functions, Pairs, Lists Dan Grossman Winter 2013.
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
File Consistency in a Parallel Environment Kenin Coloma
The Building Blocks: Send and Receive Operations
CSCI 4550/8556 Computer Networks Comer, Chapter 22: The Future IP (IPv6)
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Advancing the “Persistent” Working Group MPI-4 Tony Skjellum June 5, 2013.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
Calculations with significant figures
Blogs – what, why and how? A blog is a web-log It is a simple website that anyone can setup without any advanced computer know-how It’s the future: blogs,
Interprocess Communications
1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
Election Algorithms and Distributed Processing Section 6.5.
Tips & Tricks Groups are Great! August Use of a Group Record Individuals are not identifiable – EXAMPLE – all ten frogs look alike – EXAMPLE –
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
David A. Bryan, PPSP Workshop, Beijing, China, June 17th and 18th 2010 Tracker Protocol Proposal.
Channel Interface Takeshi Nanri (Kyushu Univ., Japan) and Shinji Sumimoto (Fujitsu Ltd.) Dec Advanced Communication for Exa (ACE) project supported.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Rebooting the “Persistent” Working Group MPI-3.next Tony Skjellum December 5, 2012.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Chapter 2 Applications and Layered Architectures Sockets.
The Socket Interface Chapter 21. Application Program Interface (API) Interface used between application programs and TCP/IP protocols Interface used between.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
An Introduction to Parallel Programming with MPI March 22, 24, 29, David Adams
1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
Distributed systems (NET 422) Prepared by Dr. Naglaa Fathi Soliman Princess Nora Bint Abdulrahman University College of computer.
1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.
Persistent Collective WG December 9, Basics Mirror regular nonblocking collective operations For each nonblocking MPI collective (including neighborhood.
Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN.
MPI Communicator Assertions Jim Dinan Point-to-Point WG March 2015 MPI Forum Meeting.
Version Control and SVN ECE 297. Why Do We Need Version Control?
The single most important skill for a computer programmer is problem solving Problem solving means the ability to formulate problems, think creatively.
UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.
An Introduction to Parallel Programming with MPI February 17, 19, 24, David Adams
3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:
Memory Hierarchy Ideal memory is fast, large, and inexpensive
MPI Point to Point Communication
Learning Objectives After interacting with this Learning Object, the learner will be able to: Explain the process of collision detection in CSMA/CD.
MPI-Message Passing Interface
Software Development Techniques
Presentation transcript:

MPI-3 Persistent WG (Point-to-point Persistent Channels) February 8, 2011 Tony Skjellum, Coordinator MPI Forum

Purpose of the Persistent WG More performance More scalability Impact pt2pt positively Impact collective positively Even faster than one sided (RMA) if possible Small new # of functions; concepts Exploit temporal locality and program regularity, exploit opportunity for early binding Ex: Want to cut critical path for send/receive to far below 500 instructions

What we discussed in Chicago, October Meeting Ability to accomplish a lot with 1 or 2 new API calls only (MPI_Bind(), MPI_Rebind()); a few more for more functionality Ability to enhance MPI scalability with such functions for regular communications Setup/teardown Number of messages in flight (1) Reactivating the next message in flight Range of persistence.. Fully early to fully late, with ability to rebind Extensions to collective are obvious and we want to include these (but lets formalize pt2pt version first) No slides from October (computer crash)

What we discussed in San Jose, December Meeting Principles: send only send, receives only receive (Brightwell) We want starts to be as simple as DMA kicks –Start = DMA kick-off with predefined descriptor in an optimal implementation Understanding outstanding channel issues is important (we need clarity on) –Start a channel –Complete a channel –Start it again –Use multiple channels of same sort to fill pipeline –Each channel is stop and wait (slack or credit of 1) We want to have the very fastest performance possible for the pair of processes Communicator used underlying the channel is irrelevant once channel set up Lifecycle of requests (MPI_Init_Send -> Request, MPI_Bind -> Request, what is relationship) Comparisons to Ready Send for optimality

Plan of Action Formal proposal - Still needed Need implementation work - Need to decide who, when, how (Tony and volunteers?) Define implementation Define goal for when we try are ready for first reading. Should implementation prove out performance gain –Not required, per Rich –Required, per Tony (we wont get votes if not fast)

Today Figure out exact API we want for formal proposal Address any semantic questions we did not resolve in December See who will help bring this into chapter/section form Plan meeting in 2 weeks or so to follow up so we are ready for next MPI Forum with Formal proposal Work details out on board and write up

Today II Can we agree on Ready Send format Can we agree on how to make channel be as simple as DMA kick off? Other caveats? Other ideas? Can we use bundles of channels? Do we need the autoincrementing mod N format for circular buffer transmissions (both ends) … active data types?

MPI_Bind/Rebind (2 functions) Purpose: reduce the critical path for regular, repetitive transmissions as much as possible Build persistent channels Define syntax and semantics for OOB point-to-point transmissions that pinch off from communicators Two functions, plus one variant: –MPI_Bind(), MPI_Rebind() –Recommended nonblocking too: MPI_Ibind(), MPI_Irebind() Enhanced semantics of Start, Startall, Wait, Waitall, etc for these kinds of requests

MPI_Bind starts with standard persistent requests These dont change (including all variants Ssend, Rsend, but no Bsend): –int MPI_Send_init( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *S_request ) –int MPI_Recv_init( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *R_request ) These first stage requests will be transformed into persistent requests These first stage requests can be reused Both sides must do bind in a bounded time of each other; matching is the sequential post order per proc.

Protocol and Rendezvous There is a bind protocol: –Send side: MPI_Bind(S_Request, &PS_Request, info, comm); –Recv side: MPI_Bind(R_Request, &PR_Request, info, comm); This is OOB from other communication on the communicator comm, and matches exactly as the original pt2pt send/recv would match Once bound, the communication space is independent of comm, and no ordering guarantees exist between this pair and other communication on comm S_Request, R_Request still exist at this point, and are not related to the Bind; they can be reused for normal communication on the communicator, or as input to further bind calls Info: anything special we want to indicate in terms of algorithmic issues, specialized network hints, etc

Protocol and Rendezvous, 2 Non blocking variants possible too: –Send side: MPI_Ibind(S_Request, &PS_Request, info, comm, &EphReq); –Recv side: MPI_Ibind(R_Request, &PR_Request, info, comm, &EphReq); Blocking and nonblocking binds can match The Ephemeral Requests are simply waited on for completion like any other pt2pt operation.

Using the bound requests The sender wants to send, then he/she does –MPI_Start(PS_Request, &status); The receiver wants to receive, then he/she does –MPI_Start(PR_Request, &status); You can wait/test (in future, mprobe) Sender side: You cannot Start() again until waited. Receiver side: Must have received data before restart. Only one message in flight Rationale: Sender side test should be lightweight (no polling) to allow for lightweight test for completion Receive side: Waitsome or Test should allow lightweight (non polling or short polling) approaches to allow for quick test for completion

Auto Addressing Buffers: Managing abstractions like autoincrement send/receive A special form of early binding allows for the incrementation of a buffer by a fixed length, mod N, so that a circular send and/or receive buffer is enabled, with a single channel. {N=stride to next buffer, not buffer len} For a DMA underlying implementation, this would be a modify and a kick instead of just a kick. Still can be quite fast compared to full weight send, receive, particularly because the channel could algorithmically set this up in advance This use case requires adding more syntax, but it is very valuable to capture New datatype approach is a clever way to get this feature FYI. Need full concept, but the use case is valuable.

Bundles of pairwise transfers To allow expression multiple messages in flight (parallel channels), applications could make sets of bound requests for a given pair of {sender,receiver} A set of requests is enumerated list of persistent requests made by bind. They can allow a user program (e.g., with ascending tags), to order a set of transmissions, round robin, and therefore ensure full bandwidth transmission between pairs, rather than stop and wait behavior as a single in flight channel would imply. We can leave all that up to the user, each request has to be bound. Optionally, we could provide something to support bundles

Bundles cond Because there is a relatively high cost for bind, we can offer additional API for MPI_Mbind(): –Send side: MPI_Mbind(S_Request, &PS_Request_Array, N_requests, info, comm); –Recv side: MPI_Mbind(R_Request, &PR_Request_Array, info, comm); –This array of requests can be used in any way by the sender/receiver, with order of matching pairwise –Right now, we require N_requests to be same on sender and receiver; we will explore non-one-to-one later Not to be overly lugubrious, but we could just have this API, and use N_requests=1, to keep # of functions to a tiny number

Rebind Protocol and Rendezvous There is a rebind protocol too: –int MPI_Rebind( void *buf, int count, MPI_Datatype datatype, int dest, int tag, [need enum of xfer protocol], MPI_Comm comm, MPI_Info info, INOUT MPI_Request *S_request ) –int MPI_Rebind( void *buf, int count, MPI_Datatype datatype, int source, int tag, [need enum of xfer protocol], MPI_Comm comm, MPI_Info info, INOUT MPI_Request *R_request ) Use same comm as in original bind. This is OOB from other communication on the communicator comm, and matches exactly as the original pt2pt send/recv would match Once bound, the communication space is independent of comm, and no ordering guarantees exist between this pair and other communication on comm This should be cheaper than a Bind(), but not necessarily. Mrebind() on subsets of the original set of channels made by Mbind() is allowed too.

Clean up MPI_Request_free() is not our favorite way of ending channels Recommend: –MPI_Bind_free(MPI_Request *req, int N_requests) –These can be any N channels made over any communicators with any pairs, and is loosely collective between the pairs only –MPI_Ibind_free() suggested also! We always prefer non- blocking :-) The original communicator or communicators used to form this channel could have already been destroyed before this happens, thats OK

MPI_Info in Bind/Rebind