OpenFabrics 2.0 Sean Hefty Intel Corporation. Claims Verbs is a poor semantic match for industry standard APIs (MPI, PGAS,...) –Want to minimize software.

Slides:



Advertisements
Similar presentations
MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Advertisements

Device Virtualization Architecture
Ch:8 Design Concepts S.W Design should have following quality attribute: Functionality Usability Reliability Performance Supportability (extensibility,
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
MPI Requirements of the Network Layer OFA 2.0 Mapping MPI community feedback assembled by Jeff Squyres, Cisco Systems Sean Hefty, Intel Corporation.
KOFI Stan Smith Intel SSG/DPD January, 2015 Kernel OpenFabrics Interface.
OFED TCP Port Mapper Proposal June 15, Overview Current NE020 Linux OFED driver uses host TCP/IP stack MAC and IP address for RDMA connections Hardware.
Uncovering Performance and Interoperability Issues in the OFED Stack March 2008 Dennis Tolstenko Sonoma Workshop Presentation.
OASIS Reference Model for Service Oriented Architecture 1.0
Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
COE 342: Data & Computer Communications (T042) Dr. Marwan Abu-Amara Chapter 2: Protocols and Architecture.
CS533 Concepts of OS Class 16 ExoKernel by Constantia Tryman.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
Stan Smith Intel SSG/DPD February, 2015 Kernel OpenFabrics Interface kOFI Framework.
VLAN Trunking Protocol (VTP) W.lilakiatsakun. VLAN Management Challenge (1) It is not difficult to add new VLAN for a small network.
IB ACM InfiniBand Communication Management Assistant (for Scaling) Sean Hefty.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
Open Fabrics Interfaces Architecture Introduction Sean Hefty Intel Corporation.
Discussing an I/O Framework SC13 - Denver. #OFADevWorkshop 2 The OpenFabrics Alliance has recently undertaken an effort to review the dominant paradigm.
Chapter 4: Managing LAN Traffic
VLAN Trunking Protocol (VTP)
Protocols and the TCP/IP Suite
LWIP TCP/IP Stack 김백규.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
LWIP TCP/IP Stack 김백규.
OpenFabrics 2.0 or libibverbs 1.0 Sean Hefty Intel Corporation.
Scalable Fabric Interfaces Sean Hefty Intel Corporation OFI software will be backward compatible.
Scalable name and address resolution infrastructure -- Ira Weiny/John Fleck #OFADevWorkshop.
OFI SW - Progress Sean Hefty - Intel Corporation.
(Business) Process Centric Exchanges
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
Network Programming Eddie Aronovich mail:
Remote Shell CS230 Project #4 Assigned : Due date :
IWARP Status Tom Tucker. 2 iWARP Branch Status  OpenFabrics SVN  iWARP in separate branch in SVN  Current with trunk as of SVN 7626  Support for two.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
Advanced Sockets API-II Vinayak Jagtap
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Stan Smith Intel SSG/DPD February, 2015 Kernel OpenFabrics Interface Initialization.
1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Dynamic Host Configuration Protocol (DHCP)
A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan
Part 4: Network Applications Client-server interaction, example applications.
IB Verbs Compatibility
OFI SW Sean Hefty - Intel Corporation. Target Software 2 Verbs 1.x + extensions 2.0 RDMA CM 1.x + extensions 2.0 Fabric Interfaces.
OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
Multicasting  A message can be unicast, multicast, or broadcast. Let us clarify these terms as they relate to the Internet.
REST By: Vishwanath Vineet.
CSCI 330 UNIX and Network Programming Unit XIV: User Datagram Protocol.
OpenFabrics 2.0 rsockets+ requirements Sean Hefty - Intel Corporation Bob Russell, Patrick MacArthur - UNH.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Open Fabrics Interfaces Software Sean Hefty - Intel Corporation.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.
Fabric: A Retrospective on Evolving SDN Presented by: Tarek Elgamal.
Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
LWIP TCP/IP Stack 김백규.
A Brief Introduction to OpenFabrics Interfaces - libfabric
Java Beans Sagun Dhakhwa.
Fabric Interfaces Architecture – v4
Advancing open fabrics interfaces
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Request ordering for FI_MSG and FI_RDM endpoints
Application taxonomy & characterization
Ch 17 - Binding Protocol Addresses
Presentation transcript:

OpenFabrics 2.0 Sean Hefty Intel Corporation

Claims Verbs is a poor semantic match for industry standard APIs (MPI, PGAS,...) –Want to minimize software overhead ULPs continue to desire additional functionality –Difficult to integrate into existing infrastructure OFA is seeing fragmentation –Existing interfaces are constraining features –Vendor specific interfaces 2

Proposal Evolve the verbs framework into a more generic open fabrics framework –Fold in RDMA CM interfaces –Merge kernel interfaces under one umbrella Give users a fully stand-alone library –Design to be redistributable Design in extensibility –Based on verbs extension work –Allow for vendor-specific extensions Export low-level fabric services –Focus on abstracted hardware functionality 3

Analysis A “Brief” Look at API Requirements Datagram – streaming Connected – unconnected Client-server – point to point Multicast Tag matching Active messages Reliable datagram Strided transfers One-sided reads/writes Send-receive transfers Triggered transfers Atomic operations Collective operations Synchronous - asynchronous transfers QoS Ordering – flow control 4 But, wait, there’s more!

Observations A single API cannot meet all requirements and still be usable Any particular app is likely to need only a small subset of such a large API Extensions will still be required –There is no correct API! We need more than an updated API – we need an updated infrastructure 5

Proposed OpenFabrics Framework 6 Fabric Framework OFA Provider IB Verbs Verbs Provider Verbs Fabric Interfaces Transition from providing verbs API to providing fabric interfaces

Architecture 7 FI Framework Vendor Provider Fabric Interfaces Dynamic Provider OFA Provider Usable as a stand- alone library Can support external providers Provides core functionality needed by providers Exports control interface used to discover supported fabric interfaces Defines fabric interfaces

Fabric Interfaces 8 Fabric Interfaces (examples only) Message Queue Control Interface Control Interface RDMA Atomics Active Messaging Tag Matching Collective Operations CM Services Fabric Provider Implementation Message Queue CM Services RDMA Collective Operations Control Interface Framework defines multiple interfaces Vendors provide optimized implementations

Fabric Interfaces Defines philosophy for interfaces and extensions Exports a minimal API –Control interface Providers built into library –Support external providers Design to be redistributable –Define guidelines for vendor distribution –Allow for application optimized build Includes initial objects and interface definitions 9

Philosophy Extensibility –Easy to add functionality to existing or new APIs –Ability to extend structures Expose primitive network and fabric services –Strike balance between exposing the bare metal, versus trying to be the high level API –Enable provider innovation without exposing details to all applications –Allow more innovation to occur without applications needing to change 10 Agile Interface

Philosophy Performance –≥ existing solutions –Minimize control data to/from the library –Allow for optimized usage models –Asynchronous operation 11

Thoughts What possibilities are there if we move from 1.x to 2.0? 12 What if we don’t constrain ourselves? –Remove full compatibility as a requirement Work from a more ideal solution backwards –See where we end up and take aim at compatibility from there

struct ibv_sge { uint64_taddr; uint32_tlength; uint32_tlkey; }; struct ibv_send_wr { uint64_twr_id; struct ibv_send_wr *next; struct ibv_sge *sg_list; intnum_sge; enum ibv_wr_opcodeopcode; intsend_flags; uint32_timm_data; union { struct { uint64_tremote_addr; uint32_trkey; } rdma; struct { uint64_tremote_addr; uint64_tcompare_add; uint64_tswap; uint32_trkey; } atomic; struct { struct ibv_ah *ah; uint32_tremote_qpn; uint32_tremote_qkey; } ud; } wr; }; Sending Using Verbs 13 For a simple asynchronous send, apps need to provide this: (I can’t read it either) Verbs asks for this Union supports other operations More than a semantic mismatch

Sending Using Verbs struct ibv_sge { uint64_taddr; uint32_tlength; uint32_tlkey; }; struct ibv_send_wr { uint64_twr_id; struct ibv_send_wr *next; struct ibv_sge *sg_list; intnum_sge; enum ibv_wr_opcodeopcode; intsend_flags; uint32_timm_data;... }; 14 Application request Must link to separate SGL and initialize count Requests may be linked - next must be set to NULL 3 x 8 = 24 bytes of data needed SGE + WR = 88 bytes allocated App must set and provider must switch on opcode Must clear flags 28 additional bytes initialized Significant SW overhead

Alternative Model? (*send)(fid, buf, len, flags, context); (*sendto)(fid, buf, len, flags, dest_addr, addrlen, context); (*sendmsg)(fid, *fi_msg, flags); (*write)(fid, buf, count, context); (*writev)(fid, iov, iovcnt, context); 15 What about an asynchronous socket model? Define extensible collection of interfaces suitable for sending and receiving messages Optimized interfaces Socket APIs have held up well against evolving networks

union { struct { uint64_tremote_addr; uint32_trkey; } rdma; struct { uint64_tremote_addr; uint64_tcompare_add; uint64_tswap; uint32_trkey; } atomic; struct { struct ibv_ah *ah; uint32_tremote_qpn; uint32_tremote_qkey; } ud; } wr; Sending Using Verbs 16 Other operations handled similarly Define RDMA and atomic specific interfaces Allow apps to ‘connect’ UD socket to specific destination

Verbs Completions struct ibv_wc { uint64_twr_id; enum ibv_wc_statusstatus; enum ibv_wc_opcodeopcode; uint32_tvendor_err; uint32_tbyte_len; uint32_timm_data; uint32_tqp_num; uint32_tsrc_qp; intwc_flags; uint16_tpkey_index; uint16_tslid; uint8_tsl; uint8_tdlid_path_bits; }; 17 Provider must fill out all fields, even if app ignores some Developer must determine if fields apply to their QP Single structure is 48 bytes – likely to cross cacheline boundary App must check both return code and status to determine if a request completed successfully

Verbs Completions struct ibv_wc { uint64_twr_id; enum ibv_wc_statusstatus; enum ibv_wc_opcodeopcode; uint32_tvendor_err; uint32_tbyte_len; uint32_timm_data; uint32_tqp_num; uint32_tsrc_qp; intwc_flags; uint16_tpkey_index; uint16_tslid; uint8_tsl; uint8_tdlid_path_bits; }; 18 Let application identify needed data Report unexpected errors ‘out of band’ Separate addressing data from completion data Use compact structures with only needed data exchanged across interface

Proposal Summary Merge existing APIs into a cohesive interface Abstract above the hardware –Enable optimizations to reduce memory writes, decrease allocated buffer space, minimize cache footprint, and avoid code branches Focus APIs on the semantics and services offered by the hardware and not the implementation –Message queues and RDMA, versus QPs –Minimize API churn for every hardware feature 19

Moving Forward Critical to have wide support and shared ownership –General agreement on approach Define control interfaces and object models –Effectively instantiate the framework Describe fabric interfaces 20 Success ultimately depends on adoption – vendors AND users Use open source processes

Open Fabrics libfabric - Proposal

Path Forward Framework must efficiently support existing HW –Compelling adoption and migration story –Some legacy elements Move focus from HW to application semantics –Make the users happy 22 Provide clear path for moving applications and providers forward

Path Forward Reach agreement on framework infrastructure –Control interfaces and basic objects Define a couple of simple API sets –Derived from current usage models –E.g. CM and message queue APIs Design application tuned APIs Proposed time-driven release schedule –Target initial release within 12 months 23

Philosophy Administrator configured –Based on Linux networking options –Simplify application use –Provider defined defaults with administrator control 24

Architecture 25 libfabric Vendor Provider Fabric Interfaces Dynamic Provider OFA Provider

Control Interface Discover fabric providers and services Identify resources and addressing fi_getinfo Allocate fabric communication portal fi_socket Open resource domain and interfaces fi_open Dynamic providers publish control interfaces fi_register 26 FI Framework fi_getinfo fi_freeinfo fi_socket fi_open fi_register

Object Model 27 Resource Domain Protection Domain Shared Receive Queues Event Collectors Address Vectors Fabric Socket Unbound Interfaces Kernel uAPI Provider I/F Fabric Interfaces Boundary of resource sharing Binds to resources Identified by name Helper interfaces and provider specific capabilities

Fabric Interface Descriptors Based on object-oriented programming Derived objects define interfaces – New interfaces exposed – Define behavior of inherited interfaces – Optimize implementation FID – Base object identifier – Control interfaces 28

Fabric Socket Interfaces 29 Type Protocol Address Base Socket API CM Base Socket API CM Message Transfers RDMA Tagged Atomics Collectives Message Transfers RDMA Tagged Atomics Collectives Properties Interfaces Evolution of RDMA CM & QP Interfaces enabled based on protocol Interface implementation optimized based on socket properties

Event Collectors 30 Format Wait Object Domain Context only Data Tagged Addressing CM Error Context only Data Tagged Addressing CM Error None fd mwait None fd mwait Properties Interface Details Common abstraction for asynchronous events User specified wait object Optimized event data Optimize interface around reporting successful operations

Address Vectors 31 Format INET INET6 IB FI Address AV index INET INET6 IB FI Address AV index Properties Interface Details Maps network addresses to fabric specific addressing Encapsulates fabric specific requirements - Address resolution - Route resolution - Address handles Can be referenced for group communication Configure resource domain to use specific address formats

Compatibility Support migration path for apps –Allow software to evolve to new framework selectively –Goal: increase adoption rate Define ‘compatibility’ mode –Not all features may be supportable –Restricts implementation –Goal: fully compatible 32

Adjacent Interfaces 33 libfabric Dual-Provider Library Adjacent Interface Fabric Interfaces Using fabric interfaces with adjacent interfaces OFA Provider Adjacent Interface FI calls go directly to provider Provider library must understand both interfaces Provider exports adjacent interface

Mapping Between Interfaces 34 libfabric Dual-Provider Library Adjacent Interface Fabric Interfaces Separate object domains OFA Provider Adjacent Interface Mapping dependent on underlying implementation Define mappings and interfaces to map objects between domains

Moving Forward Involve key users and contributors Consider alternates –Identify commonalities and differences –Resolve issues Discuss and refine details –Moving in the desired direction 35 Collect, analyze, and discuss proposals

Fabric Information struct fi_info { struct fi_info*next; size_tsize; uint64_tflags; uint64_ttype; uint64_tprotocol; enum fi_iov_formatiov_format; enum fi_addr_formataddr_format; enum fi_addr_formatinfo_addr_format; size_tsrc_addrlen; size_tdst_addrlen; void*src_addr; void*dst_addr; size_tauth_keylen; void*auth_key; intshared_fd; char*domain_name; size_tdatalen; void*data; }; 36

Base Fabric Descriptor struct fi_ops { size_tsize; int(*close)(fid_t fid); int(*bind)(fid_t fid, struct fi_resource *fids, int nfids); int(*sync)(fid_t fid, uint64_t flags, void *context); int(*control)(fid_t fid, int command, void *arg); }; struct fid { intfclass; intsize; void*context; struct fi_ops*ops; }; 37

FI - Communication enum fid_type { FID_UNSPEC, /* pick better name */ FID_MSG, FID_STREAM, FID_DGRAM, FID_RAW, FID_RDM, FID_PACKET, FID_MAX }; #define FID_TYPE_MASK0xFF enum fi_proto { FI_PROTO_UNSPEC, FI_PROTO_IB_RC, FI_PROTO_IWARP, FI_PROTO_IB_UC, FI_PROTO_IB_UD, FI_PROTO_IB_XRC, FI_PROTO_RAW, FI_PROTO_MAX }; #define FI_PROTO_MASK0xFF #define FI_PROTO_MSG(1ULL << 8) #define FI_PROTO_RDMA(1ULL << 9) #define FI_PROTO_TAGGED(1ULL << 10) #define FI_PROTO_ATOMICS(1ULL << 11) /* Multicast uses MSG ops */ #define FI_PROTO_MULTICAST (1ULL << 12) /*#define FI_PROTO_COLLECTIVES(1ULL << 13)*/ 38

FI – Communication - MSG struct fi_ops_msg { size_tsize; ssize_t (*recv)(fid_t fid, void *buf, size_t len, void *context); ssize_t (*recvmem)(fid_t fid, void *buf, size_t len, uint64_t mem_desc, void *context); ssize_t (*recvv)(fid_t fid, const void *iov, size_t count, void *context); ssize_t (*recvfrom)(fid_t fid, void *buf, size_t len, const void *src_addr, void *context); ssize_t (*recvmemfrom)(fid_t fid, void *buf, size_t len, uint64_t mem_desc, const void *src_addr, void *context); ssize_t (*recvmsg)(fid_t fid, const struct fi_msg *msg, uint64_t flags); /* corresponding send calls */ }; 39

FI – Communication struct fid_socket { struct fidfid; struct fi_ops_sock*ops; struct fi_ops_msg*msg; struct fi_ops_cm*cm; struct fi_ops_rdma*rdma; struct fi_ops_tagged*tagged; /* struct fi_ops_atomics*atomic; */ }; 40