Download presentation
Presentation is loading. Please wait.
Published byGodwin French Modified over 9 years ago
1
Towards MPI progression layer elimination with TCP and SCTP
Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver, Canada Distributed Systems Group HIPS 2006 April 25
2
Will my application run?
Portability Aspect of parallel processing integration MPI API provides interface for portable parallel applications, independent of MPI implementation Will my application run?
3
any MPI Implementation
MPI API User Code any MPI Implementation Resources
4
Will my application perform well?
Portability Aspect of parallel processing integration MPI API provides interface for portable parallel applications, independent of MPI implementation MPI Middleware provides glue for a variety of underlying components required for a complex parallel runtime environment, independent of component implementation Will my application perform well?
5
any MPI Implementation
MPI Middleware User Code any MPI Implementation Resources
6
MPI Middleware MPI Middleware User Code Glues together components
Job Scheduler Component Job Scheduler Component Process Manager Component Process Manager Component Message Progression Communication Component Transport Operating System Network
7
Message Progression Communication Component
User Code Maintains necessary state between MPI calls Calls not a simple library function Manages underlying communication through the OS (e.g. TCP) direct low-level interaction (e.g. Infiniband) MPI Middleware Message Progression Communication Component Transport OS Network
8
Communication Requirements
Common: Portability by having support for all potential interconnects In this work: Portability by eliminating this component by assuming IP! Push MPI functionality down onto IP-based transports Learn about necessary MPI implementation design changes
9
Component Elimination
User Code MPI Middleware/Library Job Scheduler Component Process Manager Component Message Progression Communication Component Operating System Transport Network
10
Elimination Motivation
Common approach Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible Our approach here Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability (perhaps?) Our approach here Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability Common approach Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible
11
Elimination Approach View MPI as a protocol, from a networking point-of-view MPI Message matching Expected / unexpected queues Short / long protocol Networking Demultiplexing Storage/buffering Flow control Design MPI with elimination as a goal
12
MPI Implementation Designs
TCP SCTP
13
TCP Socket Per TRC General scheme Control port
Socket per MPI message stream (tag-rank-context (TRC)) Control port MPI_Send calls connect (MPI_Recv could wildcard) Resulting socket stored in table attached to communicator object
14
TCP-MPI as a Protocol Matching Queues Short/long
select() fd sets for wildcards Queues Unexpected = socket buffer w/ flow control Expected = more local, attached to handles Short/long No distinction, rely on TCP flow control
15
TCP per TRC critique Design achieves elimination, but…
# sockets – OS user limits Expense of sys calls (context switch, copying) select() – doesn’t scale Flow control Mismatch : transport/OS = event driven vs. MPI application = control-driven
16
SCTP-based design
17
What is SCTP? Stream Control Transmission Protocol
General purpose unicast transport protocol for IP network data communications Recently standardized by IETF Can be used anywhere TCP is used
18
Available SCTP stacks BSD / Mac OS X
LKSCTP – Linux Kernel and later Solaris 10 HP OpenCall SS7 OpenSS7 Other implementations listed on sctp.org for Windows, AIX, VxWorks, etc.
19
Relevant SCTP features
Multistreaming One-to-many socket style Multihoming Message-based
20
Logical View of Multiple Streams in an Association
Flow control per association (not stream)
21
Using SCTP for MPI TRC-to-stream map matches MPI semantics
22
SCTP-MPI as a protocol Matching – required since cannot receive from a particular stream sctp_recvmsg() = ANY_RANK + ANY_TAG Avoids select() through one-to-many socket Queues – globally required for matching Short/Long – required; flow control not per stream
23
SCTP and elimination SCTP thins the middleware but the component cannot be eliminated Need flow control per stream Need ability to receive from stream Need ability to query which streams have data ready
24
Conclusions TCP design eliminates but doesn’t scale
SCTP scales but only thins component SCTP one-to-many socket style requires additional features for elimination Flow control per stream Ability to receive from stream Ability to query which streams have data ready
25
More information about our work is at:
Thank you! More information about our work is at: Or Google “sctp mpi”
26
Upcoming annual SCTP Interop
July 30 – Aug 4, 2006 to be held at UBC Vendors and implementers test their stacks Performance Interoperability
27
Extra slides
28
MPI Point-to-Point MPI_Send(msg,cnt,type,dst-rank,tag,context)
MPI_Recv(msg,cnt,type,src-rank,tag,context) Message matching is done based on Tag, Rank and Context (TRC). Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered. Use of wildcards for receive
29
MPI Messages Using Same Context, Two Processes
30
MPI Messages Using Same Context, Two Processes
Out of order messages with same tags violate MPI semantics
31
Associations and Multihoming
Endpoint X Endpoint Y Association NIC 1 NIC 2 NIC 3 NIC 4 Network 207 . 10 . x . x IP = 207 . 10 . 3 . 20 IP = 207 . 10 . 40 . 1 Network 168 . 1 . x . x IP = 168 . 1 . 10 . 30 IP = 168 . 1 . 140 . 10
32
SCTP Key Similarities Reliable in-order delivery, flow control, full duplex transfer. TCP-like congestion control Selective ACK is built-in the protocol
33
SCTP Key Differences Message oriented Added security
Multihoming, use of associations Multiple streams within an association
34
MPI over SCTP LAM and MPICH2 are two popular open source implementations of the MPI library. We redesigned LAM to use SCTP and take advantage of its additional features. Future plans include SCTP support within MPICH2.
35
How can SCTP help MPI? A redesign for SCTP thins the MPI middleware’s communication component. Use of one-to-many socket-style scales well. SCTP adds resilience to MPI programs. Avoids unnecessary head-of-line blocking with streams Increased fault tolerance in presence of multihomed hosts Built-in security features Improved congestion control Full Results
36
Partially Ordered User Messages Sent on Different Streams
37
Partially Ordered User Messages Sent on Different Streams
38
Partially Ordered User Messages Sent on Different Streams
39
Partially Ordered User Messages Sent on Different Streams
40
Partially Ordered User Messages Sent on Different Streams
41
Partially Ordered User Messages Sent on Different Streams
42
Partially Ordered User Messages Sent on Different Streams
43
Partially Ordered User Messages Sent on Different Streams
44
Partially Ordered User Messages Sent on Different Streams
45
Partially Ordered User Messages Sent on Different Streams
46
Partially Ordered User Messages Sent on Different Streams
47
Partially Ordered User Messages Sent on Different Streams
Can be received in the same order as it was sent (required in TCP).
48
Partially Ordered User Messages Sent on Different Streams
49
Partially Ordered User Messages Sent on Different Streams
50
Partially Ordered User Messages Sent on Different Streams
51
Partially Ordered User Messages Sent on Different Streams
52
Partially Ordered User Messages Sent on Different Streams
53
Partially Ordered User Messages Sent on Different Streams
54
Partially Ordered User Messages Sent on Different Streams
Delivery constraints: A must be before C and C must be before D
55
MPI Middleware { } ← Components
56
Elimination Motivation
Common approach : Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible Our approach here : Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.