Advancing open fabrics interfaces

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
KOFI Stan Smith Intel SSG/DPD January, 2015 Kernel OpenFabrics Interface.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Socket Programming.
Socket Addresses. Domains Internet domains –familiar with these Unix domains –for processes communicating on the same hosts –not sure of widespread use.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
Open Fabrics Interfaces Architecture Introduction Sean Hefty Intel Corporation.
Signature Verbs Extension Richard L. Graham. Data Integrity Field (DIF) Used to provide data block integrity check capabilities (CRC) for block storage.
OpenFabrics 2.0 Sean Hefty Intel Corporation. Claims Verbs is a poor semantic match for industry standard APIs (MPI, PGAS,...) –Want to minimize software.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
LWIP TCP/IP Stack 김백규.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
LWIP TCP/IP Stack 김백규.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
OpenFabrics 2.0 or libibverbs 1.0 Sean Hefty Intel Corporation.
Scalable Fabric Interfaces Sean Hefty Intel Corporation OFI software will be backward compatible.
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
OFI SW - Progress Sean Hefty - Intel Corporation.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Windows Network Programming ms-help://MS.MSDNQTR.2004JAN.1033/winsock/winsock/windows_sockets_start_page_2.htm 井民全.
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
NFD Permanent Face Junxiao Shi, Outline what is a permanent face necessity and benefit of having permanent faces guarantees provided by.
ISCSI Extensions for RDMA (iSER) draft-ko-iwarp-iser-02 Mike Ko IBM August 2, 2004.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Stan Smith Intel SSG/DPD February, 2015 Kernel OpenFabrics Interface Initialization.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
IB Verbs Compatibility
OFI SW Sean Hefty - Intel Corporation. Target Software 2 Verbs 1.x + extensions 2.0 RDMA CM 1.x + extensions 2.0 Fabric Interfaces.
OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
OpenFabrics 2.0 rsockets+ requirements Sean Hefty - Intel Corporation Bob Russell, Patrick MacArthur - UNH.
Open Fabrics Interfaces Software Sean Hefty - Intel Corporation.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.
PART1 Data collection methodology and NM paradigms 1.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.
TLDK overview Konstantin Ananyev 05/08/2016.
Exposing Link-Change Events to Applications
Chapter 4: Threads.
Agenda About us Why para-virtualize RDMA Project overview Open issues
Operating Systems : Overview
developing to the openfabrics interfaces libfabric
LWIP TCP/IP Stack 김백규.
z/Ware 2.0 Technical Overview
A Brief Introduction to OpenFabrics Interfaces - libfabric
Fabric Interfaces Architecture – v4
OpenStorage API part II
Persistent memory support
Distributed Mobility Management (DMM) WG DMM Work Item: Forwarding Path & Signaling Management (FPSM) draft-ietf-dmm-fpc-cpdp-01.txt IETF93, Prague.
OpenFabrics Interfaces: Past, present, and future
HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh.
Chapter 4: Threads.
Request ordering for FI_MSG and FI_RDM endpoints
OpenFabrics Interfaces Working Group Co-Chair Intel November 2016
OpenFabrics Alliance An Update for SSSI
Operating Systems : Overview
Application taxonomy & characterization
Introduction to Operating Systems
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 4: Threads.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
NVMe.
Presentation transcript:

Advancing open fabrics interfaces Sean Hefty, OFIWG Co-Chair Intel Corporation March, 2017

Moving libfabric forward Apps Users Providers Deferred and new features Enhancements Developer feedback API adjustments Provider implementation Performance optimizations Many features require migration to API 1.5 Existing apps work unmodified

Lower the barrier to adoption Many concepts are generic New! OFI Developer guide What a novel concept! Broad coverage of libfabric Design motivations, architecture, usage guidance Assumes minimal background Reviews sockets API for context Analyzes HPC requirements and their impact on an API Memory copies, network buffering, asynchronous operations, interrupts, direct HW access, ... Lower the barrier to adoption Many concepts are generic

Then it goes into detail about OFI https://github.com/ofiwg/ofi-guide New! OFI Developer guide Owl read this later Examines API trade-offs Call setup, branches, memory footprint, … Defines communication primitives Command queues, memory buffers, completion notification, progress models, data and message ordering, … Then it goes into detail about OFI https://github.com/ofiwg/ofi-guide

Old! architecture Capabilities Modes

Isolate traffic between virtual sets of peers Job Keys Authorization keys Isolate traffic between virtual sets of peers Associated with endpoints and memory regions Only endpoints with the same key may communicate Memory regions are only accessible through endpoints with matching keys

Multicast Datagram endpoint capability At Long Last Multicast Datagram endpoint capability New fi_join call to connect to group Supports send-only joins Returns fabric address (fi_addr_t) for group Uses existing data transfer APIs Only message queue interface currently defined Multicast sends require use of FI_MULTICAST flag Distinguish between unicast and multicast address Received messages also marked with FI_MULTICAST

Apps/middleware designed around socket semantics Beyond HPC New endpoint types Synchronous completions No completion queue Application owns buffers Apps/middleware designed around socket semantics Enterprise / Cloud sockets rsockets ZeroMQ nanomsg … libfabric CM AV EQ EP:SOCK_DGRAM OFI Provider EP:SOCK_STREAM fd fd Enable provider specific optimizations RMA, circular receive buffers Minimize data copies, use offload Associated with file descriptor Support for select/poll

RMA key (not address) identifies the buffer Free: 100% More Expanding ATOMICs RMA key (not address) identifies the buffer 0001 RMA Key (RMA) Atomic 11__ Tagged Atomic Message Tag 100_ FI_TAGGED op flag - Replace key with tag fi_atomic_query() - Domain level call - Also reports datatype size

Deferred work queues Generalizes triggered operations Experimental Deferred work queues Conceptual model only Generalizes triggered operations work queue ep mem cq 3 5 7 counter work request Allows (theoretical) associations between any domain level object Definition is limited Conceptual domain level work queue Primitives to develop and evaluate collective APIs e.g. directed acyclic graphs

optimizing completions Reducing Overhead optimizing completions rx/tx operation completion FI_RESTRICTED_COMP Only similar endpoints will share CQs / counters Eliminates post-processing FI_NOTIFY_FLAGS_ONLY - Drops operation flag - Retains notification flags: Remote CQ data Multi-recv buffer freed Avoids op save/restore cq New mode bits transmit receive

Completion Data Reporting errors FI_SOURCE_ERR capability App Request Multi-Threading Completion Data Reporting errors Provider specific data reported for debugging Data is opaque to application Storage managed by provider Report maximum size of error data Application provides buffer Improves multi-threaded completion processing error data cq address FI_SOURCE_ERR capability - Validate source address against local AV - Report raw fabric address if not found EADDRNOTAVAIL completion error Pushes ‘connect’ address exchange to provider receive

Feature Grab bag ? New attributes Communication scope Feedback ? Feature Grab bag Fabric: api version Domain: counter limits, memory region iov limits New attributes Local: within the local node Shared memory, most NICs Remote: with external nodes Any NIC Evading more refined definitions Communication scope FI_ADDR_STR address format Generic string format for any address Application does not need to interpret Well-known format for sockaddr Pass directly to fi_getinfo() or AV insertion String based addressing

Provider: “udp;ofi-rxd” Simplify Development Provider updates Endpoints DGRAM MSG RDM  …  Exported X Core EP type Capabilities Provider: “udp;ofi-rxd” OFI ofi-rxd ofi-rxm ofi-mxr Utility Providers    …  …    …    …  …    …    …  …    …  … OFI Utility must be layered over core. Well-known prefix name. Only slight differences in internal handling.  …     …  …    …  …     …  …    …  …    …  …    …  …    …  …    …  … Sockets (TCP) UDP Cisco usNIC * Verbs Intel OPA PSM ® Intel OPA PSM2 ® Cray GNI * IBM Blue Gene * Mellanox Mlx * New Core Providers * Other names and brands may be claimed as the property of others

Memory registration FI_MR_BASIC FI_MR_SCALABLE FI_MR_NEITHER Exposing the Pain Memory registration FI_MR_BASIC Traditional RDMA fabrics FI_MR_SCALABLE Application driven capability FI_MR_NEITHER What other fabrics support

MR Mode bits Redefine domain attribute mr_mode Exposing the Pain MR Mode bits Redefine domain attribute mr_mode enum  integer flags Divide FI_MR_BASIC into specific restrictions FI_MR_SCALABLE implied if mr_mode = 0 FI_MR_BASIC, FI_MR_SCALABLE kept for compatibility Behavior dependent on selected API version

  ‘Basic’ MR Mode bits FI_MR_VIRT_ADDR FI_MR_ALLOCATED FI_MR_LOCAL Exposing the Pain ‘Basic’ MR Mode bits FI_MR_VIRT_ADDR Target offset is virtual address 0001 virtual address: 0x8000 1001 RMA 1001 @ 0x8100 0010  1010 FI_MR_ALLOCATED Backed by physical pages 0011  1100 FI_MR_LOCAL Locally accessed buffers must be registered FI_MR_PROV_KEY Provider sets key

Provider selects base address It’s only a flesh wound Raw Memory Regions FI_MR_RAW More generic version of ‘basic’ registration Provider selects base address 0001 base address: ? 10011001 RMA raw key @ base + offset > 64-bit keys 0010 fi_mr_raw_attr() Retrieve MR attributes 10101010 0011 11001100 fi_mr_map_raw() Peer sets up MR

non-omnipotent providers Annoying itch non-omnipotent providers FI_MR_MMU_NOTIFY App notifies provider when physical pages backing MR change I.e. almost scalable, but NIC not linked into MMU Only those addresses accessed need to be backed RMA event 5 6 8 8 8 FI_MR_RMA_EVENT App indicates if MR will be used with RMA event counting MRs must be enabled after resource binding

Looking Back Apps Users Providers Balance application needs, developer requests, and provider capabilities New features for expanding use cases Developer assistance Sensible performance optimizations

Sean Hefty, OFIWG Co-Chair THANK YOU Sean Hefty, OFIWG Co-Chair Intel