MPI Requirements of the Network Layer Presented to the OpenFabrics libfabric Working Group January 21, 2014 Community feedback assembled by Jeff Squyres,

Slides:



Advertisements
Similar presentations
MPI-3 Persistent WG (Point-to-point Persistent Channels) February 8, 2011 Tony Skjellum, Coordinator MPI Forum.
Advertisements

Device Drivers. Linux Device Drivers Linux supports three types of hardware device: character, block and network –character devices: R/W without buffering.
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
MPI Requirements of the Network Layer OFA 2.0 Mapping MPI community feedback assembled by Jeff Squyres, Cisco Systems Sean Hefty, Intel Corporation.
OFED TCP Port Mapper Proposal June 15, Overview Current NE020 Linux OFED driver uses host TCP/IP stack MAC and IP address for RDMA connections Hardware.
Advancing the “Persistent” Working Group MPI-4 Tony Skjellum June 5, 2013.
28.2 Functionality Application Software Provides Applications supply the high-level services that user access, and determine how users perceive the capabilities.
Chapter 7 Protocol Software On A Conventional Processor.
Slide 1 Client / Server Paradigm. Slide 2 Outline: Client / Server Paradigm Client / Server Model of Interaction Server Design Issues C/ S Points of Interaction.
Lecture 2 Protocol Layers CPE 401 / 601 Computer Network Systems slides are modified from Dave Hollinger.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Socket Programming.
Processes CSCI 444/544 Operating Systems Fall 2008.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
An Introduction to Device Drivers Sarah Diesburg COP 5641 / CIS 4930.
IB ACM InfiniBand Communication Management Assistant (for Scaling) Sean Hefty.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
Introduction to USB Development. USB Development Introduction Technical Overview USB in Embedded Systems Recent Developments Extensions to USB USB as.
IP Ports and Protocols used by H.323 Devices Liane Tarouco.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Communicating over the Network Network Fundamentals – Chapter 2.
ISO Layer Model Lecture 9 October 16, The Need for Protocols Multiple hardware platforms need to have the ability to communicate. Writing communications.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
OpenFabrics 2.0 or libibverbs 1.0 Sean Hefty Intel Corporation.
A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Rebooting the “Persistent” Working Group MPI-3.next Tony Skjellum December 5, 2012.
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Chapter 2 Applications and Layered Architectures Sockets.
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Lecture 4 Overview. Ethernet Data Link Layer protocol Ethernet (IEEE 802.3) is widely used Supported by a variety of physical layer implementations Multi-access.
An Introduction to Device Drivers Ted Baker  Andy Wang COP 5641 / CIS 4930.
Kemal Baykal Rasim Ismayilov
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Distributed systems (NET 422) Prepared by Dr. Naglaa Fathi Soliman Princess Nora Bint Abdulrahman University College of computer.
OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
Open MPI OpenFabrics Update April 2008 Jeff Squyres.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
OpenFabrics 2.0 rsockets+ requirements Sean Hefty - Intel Corporation Bob Russell, Patrick MacArthur - UNH.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
Fabric Interfaces Architecture – v4
CS703 - Advanced Operating Systems
Chapter 3: Windows7 Part 4.
Introduction to the Kernel and Device Drivers
An Introduction to Device Drivers
Virtualization Techniques
Lecture Topics: 11/1 General Operating System Concepts Processes
Application taxonomy & characterization
Chapter 13: I/O Systems.
Last Class: Communication in Distributed Systems
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

MPI Requirements of the Network Layer Presented to the OpenFabrics libfabric Working Group January 21, 2014 Community feedback assembled by Jeff Squyres, Cisco Systems

Slide 2 Many thanks to the contributors (in no particular order) ETZ Zurich Torsten Hoefler Sandia National Labs Ron Brightwell Brian Barrett Ryan Grant IBM Chulho Kim Carl Obert Michael Blocksome Perry Schmidt Cisco Systems Jeff Squyres Dave Goodell Reese Faucette Cesare Cantu Upinder Malhi Oak Ridge National Labs Scott Atchley Pavel Shamis Argonne National Labs Jeff Hammond

Slide 3 Many thanks to the contributors (in no particular order) Intel Sayantan Sur Charles Archer Cray Krishna Kandalla Mellanox Devendar Bureddy SGI Michael Raymond AMD Brad Benton Microsoft Fab Tillier U. Edinburgh / EPCC Dan Holmes U. Alabama Birmingham Tony Skjellum Amin Hassani Shane Farmer

Slide 4 High-level abstraction API No concept of a connection All communication: Is reliable Has some ordering rules Is comprised of typed messages Peer address is (communicator, integer) tuple I.e., virtualized Quick MPI overview

Slide 5 Communication modes Blocking and non-blocking (polled completion) Point-to-point: two-sided and one-sided Collective operations: broadcast, scatter, reduce, …etc. …and others, but those are the big ones Asynch. progression is required/strongly desired Message buffers are provided by the application They are not “special” (e.g., registered) Quick MPI overview

Slide 6 MPI specification Governed by the MPI Forum standards body Currently at MPI-3.0 MPI implementations Software + hardware implementation of the spec Some are open source, some are closed source Generally don’t care about interoperability (e.g., wire protocols) Quick MPI overview

Slide 7 Community feedback represents union of: Different viewpoints Different MPI implementations Different hardware perspectives …and not all agree with each other For example… MPI is a large community

Slide 8 Do not want to see memory registration Want tag matching E.g., PSM Trust the network layer to do everything well under the covers Want to have good memory registration infrastructure Want direct access to hardware capabilities Want to fully implement MPI interfaces themselves Or, the MPI implementers are the kernel / firmware /hardware developers Different MPI camps Those who want high level interfaces Those who want low level interfaces

Slide 9 Be careful what you ask for… …because you just got it Members of the MPI Forum would like to be involved in the libfabric design on an ongoing basis Can we get an MPI libfabric listserv?

Slide 10 Messages (not streams) Efficient API Allow for low latency / high bandwidth Low number of instructions in the critical path Enable “zero copy” Separation of local action initiation and completion One-sided (including atomics) and two-sided semantics No requirement for communication buffer alignment Basic things MPI needs

Slide 11 Asynchronous progress independent of API calls Preferably via dedicated hardware Scalable communications with millions of peers Think of MPI as a fully-connected model (even though it usually isn’t implemented that way) Today, runs with 3 million MPI processes in a job Basic things MPI needs

Slide 12 (all the basic needs from previous slide) Different modes of communication Reliable vs. unreliable Scalable connectionless communications (i.e., UD) Specify peer read/write address (i.e., RDMA) RDMA write with immediate (*) …but we want more (more on this later) Things MPI likes in verbs

Slide 13 Ability to re-use (short/inline) buffers immediately Polling and OS-native/fd-based blocking QP modes Discover devices, ports, and their capabilities (*) …but let’s not tie this to a specific hardware model Scatter / gather lists for sends Atomic operations (*) …but we want more (more on this later) Things MPI likes in verbs

Slide 14 Things MPI likes in verbs Can have multiple consumers in a single process API handles are independent of each other

Slide 15 Verbs does not: Require collective initialization across multiple processes Require peers to have the same process image Restrict completion order vs. delivery order Restrict source/target address region (stack, data, heap) Require a specific wire protocol (*) …but it does impose limitations, e.g., 40-byte GRH UD header Things MPI likes in verbs

Slide 16 Ability to connect to “unrelated” peers Cannot access peer (memory) without permission Cleans up everything upon process termination E.g., kernel and hardware resources are released Things MPI likes in verbs

Slide 17 Other things MPI wants (described as verbs improvements) MTU is an int (not an enum) Specify timeouts to connection requests …or have a CM that completes connections asynchronously All operations need to be non-blocking, including: Address handle creation Communication setup / teardown Memory registration / deregistration

Slide 18 Other things MPI wants (described as verbs improvements) Specify buffer/length as function parameters Specified as struct requires extra memory accesses …more on this later Ability to query how many credits currently available in a QP To support actions that consume more than one credit Remove concept of “queue pair” Have standalone send channels and receive channels

Slide 19 Other things MPI wants (described as verbs improvements) Completion at target for an RDMA write Have ability to query if loopback communication is supported Clearly delineate what functionality must be supported vs. what is optional Example: MPI provides (almost) the same functionality everywhere, regardless of hardware / platform Verbs functionality is wildly different for each provider

Stop for today Still consolidating more MPI community feedback