Towards MPI progression layer elimination with TCP and SCTP

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Distributed Processing, Client/Server and Clusters
A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.
RivuS Stream Control Transmission Protocol (SCTP) on BSD By- Jayesh Rane Nitin Kumbhar Kedar Sovani PICT. Guides: Prof. Rajesh B. Ingle, PICT. Mr. Adityashankar.
Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.
SCTP Tutorial Randall Stewart
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Computer Systems/Operating Systems - Class 8
Protocols and the TCP/IP Suite Chapter 4 (Stallings Book)
Introduction to Transport Layer. Transport Layer: Motivation A B R1 R2 r Recall that NL is responsible for forwarding a packet from one HOST to another.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Protocols and the TCP/IP Suite
Stream Control Transmission Protocol 網路前瞻技術實驗室 陳旻槿.
EEC-484/584 Computer Networks Lecture 6 Wenbing Zhao (Part of the slides are based on Drs. Kurose & Ross ’ s slides for their Computer.
 The Open Systems Interconnection model (OSI model) is a product of the Open Systems Interconnection effort at the International Organization for Standardization.
Process-to-Process Delivery:
Gursharan Singh Tatla Transport Layer 16-May
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Protocols and the TCP/IP Suite Chapter 4. Multilayer communication. A series of layers, each built upon the one below it. The purpose of each layer is.
Process-to-Process Delivery:
SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia Distributed Research Group.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.
CECS 474 Computer Network Interoperability Notes for Douglas E. Comer, Computer Networks and Internets (5 th Edition) Tracy Bradley Maples, Ph.D. Computer.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.
TCP1 Transmission Control Protocol (TCP). TCP2 Outline Transmission Control Protocol.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
1 Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.
SCTP: Stream Control Transfer Protocol Naveen Kumar Department of Computer and Information Sciences *Some slides have been taken from Prof. Amer.
TCOM 509 – Internet Protocols (TCP/IP) Lecture 03_b Protocol Layering Instructor: Dr. Li-Chuan Chen Date: 09/15/2003 Based in part upon slides of Prof.
23.1 Chapter 23 Process-to-Process Delivery: UDP, TCP, and SCTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
03/11/2015 Michael Chai; Behrouz Forouzan Staffordshire University School of Computing Streaming 1.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
William Stallings Data and Computer Communications
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
CHAPTER 4 PROTOCOLS AND THE TCP/IP SUITE Acknowledgement: The Slides Were Provided By Cory Beard, William Stallings For Their Textbook “Wireless Communication.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
Distributed systems (NET 422) Prepared by Dr. Naglaa Fathi Soliman Princess Nora Bint Abdulrahman University College of computer.
SCTP: A new networking protocol for super-computing Mohammed Atiquzzaman Shaojian Fu Department of Computer Science University of Oklahoma.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Advanced Higher Computing Computer Networking Topic 1: Network Protocols and Standards.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
2: Transport Layer 11 Transport Layer 1. 2: Transport Layer 12 Part 2: Transport Layer Chapter goals: r understand principles behind transport layer services:
Ch23 Ameera Almasoud 1 Based on Data Communications and Networking, 4th Edition. by Behrouz A. Forouzan, McGraw-Hill Companies, Inc., 2007.
Protocols and the TCP/IP Suite
Chapter 4: Threads.
Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.
Transport Layer.
OSI Protocol Stack Given the post man exemple.
PART 5 Transport Layer Computer Networks.
Using SCTP to hide latency in MPI programs
Transport Layer Unit 5.
Protocols and the TCP/IP Suite
Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.
Stream Control Transmission Protocol (SCTP)
Threads, SMP, and Microkernels
ECEN “Internet Protocols and Modeling”
Process-to-Process Delivery:
CS703 - Advanced Operating Systems
Advanced Computer Networks
SCTP-based Middleware for MPI
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Protocols and the TCP/IP Suite
Presentation transcript:

Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver, Canada Distributed Systems Group HIPS 2006 April 25

Will my application run? Portability Aspect of parallel processing integration MPI API provides interface for portable parallel applications, independent of MPI implementation Will my application run?

any MPI Implementation MPI API User Code any MPI Implementation Resources

Will my application perform well? Portability Aspect of parallel processing integration MPI API provides interface for portable parallel applications, independent of MPI implementation MPI Middleware provides glue for a variety of underlying components required for a complex parallel runtime environment, independent of component implementation Will my application perform well?

any MPI Implementation MPI Middleware User Code any MPI Implementation Resources

MPI Middleware MPI Middleware User Code Glues together components Job Scheduler Component Job Scheduler Component Process Manager Component Process Manager Component Message Progression Communication Component Transport Operating System Network

Message Progression Communication Component User Code Maintains necessary state between MPI calls Calls not a simple library function Manages underlying communication through the OS (e.g. TCP) direct low-level interaction (e.g. Infiniband) MPI Middleware Message Progression Communication Component Transport OS Network

Communication Requirements Common: Portability by having support for all potential interconnects In this work: Portability by eliminating this component by assuming IP! Push MPI functionality down onto IP-based transports Learn about necessary MPI implementation design changes

Component Elimination User Code MPI Middleware/Library Job Scheduler Component Process Manager Component Message Progression Communication Component Operating System Transport Network

Elimination Motivation Common approach Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible Our approach here Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability (perhaps?) Our approach here Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability Common approach Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible

Elimination Approach View MPI as a protocol, from a networking point-of-view MPI Message matching Expected / unexpected queues Short / long protocol Networking Demultiplexing Storage/buffering Flow control Design MPI with elimination as a goal

MPI Implementation Designs TCP SCTP

TCP Socket Per TRC General scheme Control port Socket per MPI message stream (tag-rank-context (TRC)) Control port MPI_Send calls connect (MPI_Recv could wildcard) Resulting socket stored in table attached to communicator object

TCP-MPI as a Protocol Matching Queues Short/long select() fd sets for wildcards Queues Unexpected = socket buffer w/ flow control Expected = more local, attached to handles Short/long No distinction, rely on TCP flow control

TCP per TRC critique Design achieves elimination, but… # sockets – OS user limits Expense of sys calls (context switch, copying) select() – doesn’t scale Flow control Mismatch : transport/OS = event driven vs. MPI application = control-driven

SCTP-based design

What is SCTP? Stream Control Transmission Protocol General purpose unicast transport protocol for IP network data communications Recently standardized by IETF Can be used anywhere TCP is used

Available SCTP stacks BSD / Mac OS X LKSCTP – Linux Kernel 2.4.23 and later Solaris 10 HP OpenCall SS7 OpenSS7 Other implementations listed on sctp.org for Windows, AIX, VxWorks, etc.

Relevant SCTP features Multistreaming One-to-many socket style Multihoming Message-based

Logical View of Multiple Streams in an Association Flow control per association (not stream)

Using SCTP for MPI TRC-to-stream map matches MPI semantics

SCTP-MPI as a protocol Matching – required since cannot receive from a particular stream sctp_recvmsg() = ANY_RANK + ANY_TAG Avoids select() through one-to-many socket Queues – globally required for matching Short/Long – required; flow control not per stream

SCTP and elimination SCTP thins the middleware but the component cannot be eliminated Need flow control per stream Need ability to receive from stream Need ability to query which streams have data ready

Conclusions TCP design eliminates but doesn’t scale SCTP scales but only thins component SCTP one-to-many socket style requires additional features for elimination Flow control per stream Ability to receive from stream Ability to query which streams have data ready

More information about our work is at: Thank you! More information about our work is at: http://www.cs.ubc.ca/labs/dsg/mpi-sctp/ Or Google “sctp mpi”

Upcoming annual SCTP Interop July 30 – Aug 4, 2006 to be held at UBC Vendors and implementers test their stacks Performance Interoperability

Extra slides

MPI Point-to-Point MPI_Send(msg,cnt,type,dst-rank,tag,context) MPI_Recv(msg,cnt,type,src-rank,tag,context) Message matching is done based on Tag, Rank and Context (TRC). Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered. Use of wildcards for receive

MPI Messages Using Same Context, Two Processes

MPI Messages Using Same Context, Two Processes Out of order messages with same tags violate MPI semantics

Associations and Multihoming Endpoint X Endpoint Y Association NIC 1 NIC 2 NIC 3 NIC 4 Network 207 . 10 . x . x IP = 207 . 10 . 3 . 20 IP = 207 . 10 . 40 . 1 Network 168 . 1 . x . x IP = 168 . 1 . 10 . 30 IP = 168 . 1 . 140 . 10

SCTP Key Similarities Reliable in-order delivery, flow control, full duplex transfer. TCP-like congestion control Selective ACK is built-in the protocol

SCTP Key Differences Message oriented Added security Multihoming, use of associations Multiple streams within an association

MPI over SCTP LAM and MPICH2 are two popular open source implementations of the MPI library. We redesigned LAM to use SCTP and take advantage of its additional features. Future plans include SCTP support within MPICH2.

How can SCTP help MPI? A redesign for SCTP thins the MPI middleware’s communication component. Use of one-to-many socket-style scales well. SCTP adds resilience to MPI programs. Avoids unnecessary head-of-line blocking with streams Increased fault tolerance in presence of multihomed hosts Built-in security features Improved congestion control Full Results Presented @

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams Can be received in the same order as it was sent (required in TCP).

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams

Partially Ordered User Messages Sent on Different Streams Delivery constraints: A must be before C and C must be before D

MPI Middleware { } ← Components

Elimination Motivation Common approach : Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible Our approach here : Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability