Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver, Canada Distributed Systems Group HIPS 2006 April 25
Will my application run? Portability Aspect of parallel processing integration MPI API provides interface for portable parallel applications, independent of MPI implementation Will my application run?
any MPI Implementation MPI API User Code any MPI Implementation Resources
Will my application perform well? Portability Aspect of parallel processing integration MPI API provides interface for portable parallel applications, independent of MPI implementation MPI Middleware provides glue for a variety of underlying components required for a complex parallel runtime environment, independent of component implementation Will my application perform well?
any MPI Implementation MPI Middleware User Code any MPI Implementation Resources
MPI Middleware MPI Middleware User Code Glues together components Job Scheduler Component Job Scheduler Component Process Manager Component Process Manager Component Message Progression Communication Component Transport Operating System Network
Message Progression Communication Component User Code Maintains necessary state between MPI calls Calls not a simple library function Manages underlying communication through the OS (e.g. TCP) direct low-level interaction (e.g. Infiniband) MPI Middleware Message Progression Communication Component Transport OS Network
Communication Requirements Common: Portability by having support for all potential interconnects In this work: Portability by eliminating this component by assuming IP! Push MPI functionality down onto IP-based transports Learn about necessary MPI implementation design changes
Component Elimination User Code MPI Middleware/Library Job Scheduler Component Process Manager Component Message Progression Communication Component Operating System Transport Network
Elimination Motivation Common approach Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible Our approach here Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability (perhaps?) Our approach here Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability Common approach Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible
Elimination Approach View MPI as a protocol, from a networking point-of-view MPI Message matching Expected / unexpected queues Short / long protocol Networking Demultiplexing Storage/buffering Flow control Design MPI with elimination as a goal
MPI Implementation Designs TCP SCTP
TCP Socket Per TRC General scheme Control port Socket per MPI message stream (tag-rank-context (TRC)) Control port MPI_Send calls connect (MPI_Recv could wildcard) Resulting socket stored in table attached to communicator object
TCP-MPI as a Protocol Matching Queues Short/long select() fd sets for wildcards Queues Unexpected = socket buffer w/ flow control Expected = more local, attached to handles Short/long No distinction, rely on TCP flow control
TCP per TRC critique Design achieves elimination, but… # sockets – OS user limits Expense of sys calls (context switch, copying) select() – doesn’t scale Flow control Mismatch : transport/OS = event driven vs. MPI application = control-driven
SCTP-based design
What is SCTP? Stream Control Transmission Protocol General purpose unicast transport protocol for IP network data communications Recently standardized by IETF Can be used anywhere TCP is used
Available SCTP stacks BSD / Mac OS X LKSCTP – Linux Kernel 2.4.23 and later Solaris 10 HP OpenCall SS7 OpenSS7 Other implementations listed on sctp.org for Windows, AIX, VxWorks, etc.
Relevant SCTP features Multistreaming One-to-many socket style Multihoming Message-based
Logical View of Multiple Streams in an Association Flow control per association (not stream)
Using SCTP for MPI TRC-to-stream map matches MPI semantics
SCTP-MPI as a protocol Matching – required since cannot receive from a particular stream sctp_recvmsg() = ANY_RANK + ANY_TAG Avoids select() through one-to-many socket Queues – globally required for matching Short/Long – required; flow control not per stream
SCTP and elimination SCTP thins the middleware but the component cannot be eliminated Need flow control per stream Need ability to receive from stream Need ability to query which streams have data ready
Conclusions TCP design eliminates but doesn’t scale SCTP scales but only thins component SCTP one-to-many socket style requires additional features for elimination Flow control per stream Ability to receive from stream Ability to query which streams have data ready
More information about our work is at: Thank you! More information about our work is at: http://www.cs.ubc.ca/labs/dsg/mpi-sctp/ Or Google “sctp mpi”
Upcoming annual SCTP Interop July 30 – Aug 4, 2006 to be held at UBC Vendors and implementers test their stacks Performance Interoperability
Extra slides
MPI Point-to-Point MPI_Send(msg,cnt,type,dst-rank,tag,context) MPI_Recv(msg,cnt,type,src-rank,tag,context) Message matching is done based on Tag, Rank and Context (TRC). Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered. Use of wildcards for receive
MPI Messages Using Same Context, Two Processes
MPI Messages Using Same Context, Two Processes Out of order messages with same tags violate MPI semantics
Associations and Multihoming Endpoint X Endpoint Y Association NIC 1 NIC 2 NIC 3 NIC 4 Network 207 . 10 . x . x IP = 207 . 10 . 3 . 20 IP = 207 . 10 . 40 . 1 Network 168 . 1 . x . x IP = 168 . 1 . 10 . 30 IP = 168 . 1 . 140 . 10
SCTP Key Similarities Reliable in-order delivery, flow control, full duplex transfer. TCP-like congestion control Selective ACK is built-in the protocol
SCTP Key Differences Message oriented Added security Multihoming, use of associations Multiple streams within an association
MPI over SCTP LAM and MPICH2 are two popular open source implementations of the MPI library. We redesigned LAM to use SCTP and take advantage of its additional features. Future plans include SCTP support within MPICH2.
How can SCTP help MPI? A redesign for SCTP thins the MPI middleware’s communication component. Use of one-to-many socket-style scales well. SCTP adds resilience to MPI programs. Avoids unnecessary head-of-line blocking with streams Increased fault tolerance in presence of multihomed hosts Built-in security features Improved congestion control Full Results Presented @
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams Can be received in the same order as it was sent (required in TCP).
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams
Partially Ordered User Messages Sent on Different Streams Delivery constraints: A must be before C and C must be before D
MPI Middleware { } ← Components
Elimination Motivation Common approach : Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible Our approach here : Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability