Scalable Fabric Interfaces Sean Hefty Intel Corporation OFI software will be backward compatible.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

System Integration and Performance
Categories of I/O Devices
COM vs. CORBA.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
CS-334: Computer Architecture
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
File Management Systems
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Chapter 12 Pipelining Strategies Performance Hazards.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 13 Embedded Systems
Chapter 12 CPU Structure and Function. Example Register Organizations.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Hardware Interface Design Patterns Ahmet Selman Bozkır – Hacettepe Univ.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
Open Fabrics Interfaces Architecture Introduction Sean Hefty Intel Corporation.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Chapter 7 Input/Output Luisa Botero Santiago Del Portillo Ivan Vega.
Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.
OpenFabrics 2.0 Sean Hefty Intel Corporation. Claims Verbs is a poor semantic match for industry standard APIs (MPI, PGAS,...) –Want to minimize software.
Chapter 10: Input / Output Devices Dr Mohamed Menacer Taibah University
LWIP TCP/IP Stack 김백규.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
OpenFabrics 2.0 or libibverbs 1.0 Sean Hefty Intel Corporation.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Scalable name and address resolution infrastructure -- Ira Weiny/John Fleck #OFADevWorkshop.
OFI SW - Progress Sean Hefty - Intel Corporation.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
I/O management is a major component of operating system design and operation Important aspect of computer operation I/O devices vary greatly Various methods.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
Device Drivers CPU I/O Interface Device Driver DEVICECONTROL OPERATIONSDATA TRANSFER OPERATIONS Disk Seek to Sector, Track, Cyl. Seek Home Position.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
IB Verbs Compatibility
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
OFI SW Sean Hefty - Intel Corporation. Target Software 2 Verbs 1.x + extensions 2.0 RDMA CM 1.x + extensions 2.0 Fabric Interfaces.
File Systems cs550 Operating Systems David Monismith.
OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.
Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.
SunGuide SM Software Development Project End of the Year ITS Working Group Meeting December 7, 2005.
OpenFabrics 2.0 rsockets+ requirements Sean Hefty - Intel Corporation Bob Russell, Patrick MacArthur - UNH.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Open Fabrics Interfaces Software Sean Hefty - Intel Corporation.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.
W4118 Operating Systems Instructor: Junfeng Yang.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
Muen Policy & Toolchain
A Brief Introduction to OpenFabrics Interfaces - libfabric
System Programming and administration
Fabric Interfaces Architecture – v4
Advancing open fabrics interfaces
Flash EEPROM Emulation Concepts
Page Replacement.
Operating Systems Chapter 5: Input/Output Management
* From AMD 1996 Publication #18522 Revision E
Ch 17 - Binding Protocol Addresses
Lecture 4: Instruction Set Design/Pipelining
Chapter 13: I/O Systems.
Presentation transcript:

Scalable Fabric Interfaces Sean Hefty Intel Corporation OFI software will be backward compatible

OFI WG Charter 2 Develop an extensible, open source framework and interfaces aligned with ULP and application needs for high-performance fabric services

Reduced cache and memory footprint Optimized software path to hardware Independent of hardware interface, version, features More agile development Time-boxed, iterative development Application focused APIs Adaptable Enable.. 3 Minimal footprint High performance App-centric Extensible Analyze application needs Implement them in a coherent, concise, high-performance manner

Minimal footprint How can an API affect application scalability? I’m glad I asked. 4

Communication struct rdma_route { struct rdma_addr addr; struct ibv_sa_path_rec *path_rec;... }; struct rdma_cm_id {...}; rdma_create_id() rdma_resolve_addr() rdma_resolve_route() rdma_connect() Src/dst addresses stored per endpoint 456 bytes per endpoint Path record per endpoint Resolve single address and path at a time All to all connected model for best performance Reliable data transfers, zero copies to thousands of processes 5

Scalable Communication Application driven communication models Reliable unconnected transfers –Abstract hardware features SRQ, XRC, dynamic connections Optimize addressing –Resolve multiple resolution requests at once –Compact address data storage Compressed address ranges, path data Support multiple resolution mechanisms –Optimized for different topologies and fabric sizes 6

SFI - Address Vectors Store addresses/host names - Insert range of addresses with single call Share between processes Enable provider optimization techniques - Greatly reduce storage requirements Reference entries by handle or index - Handle may be encoded fabric address Reference vector for group communication Example only 7

Can API changes unlock higher performance? Just a guess, but is the answer “yes”? High performance 8

Application Send struct ibv_sge { uint64_taddr; uint32_tlength; uint32_tlkey; }; struct ibv_send_wr { uint64_twr_id; struct ibv_send_wr *next; struct ibv_sge *sg_list; intnum_sge; enum ibv_wr_opcodeopcode; intsend_flags; uint32_timm_data;... }; 9 Application request 3 x 8 = 24 bytes of data needed SGE + WR = 88 bytes allocated Requests may be linked - next must be set to NULL Must link to separate SGL and initialize count App must set and provider must switch on opcode Must clear flags 28 additional bytes initialized Significant SW overhead

Provider Send For each work request Check for available queue space Check SGL size Check valid opcode Check flags x 2 Check specific opcode Switch on QP type Switch on opcode Check flags For each SGE Check size Loop over length Check flags Check Check for last request Other checks x branches including loops 100+ lines of C code lines of code to HW Most often 1 (overlap operations) Often 1 or 2 (fixed in source) Artifact of API QP type usually fixed in source Flags may be fixed or app may have taken branches

Scalable Transfer Interfaces Application optimized code paths based on usage model Optimize call(s) for single work request –Single data buffer or 2-entry SGL –Still support more complex WR lists/SGL Per endpoint send/receive operations –Separate RMA function calls Pre-configure data transfer flags –Known before post request –Select software path through provider 11

SFI – Send Message Allocate WR Allocate SGE Format SGE – 3 writes Format WR – 6 writes Loop 1 Checks – 9 branches Loop 2 Check Loop 3 Checks – 3 branches Direct call – 3 writes Checks – 2 branches lines of C-code25-30 lines of C-code generic send calloptimized send call Reduce setup cost - Tighter data Eliminate loops and branches - Remaining branches predictable Selective optimization paths to HW - Manual function expansion

Completions struct ibv_wc { uint64_twr_id; enum ibv_wc_statusstatus; enum ibv_wc_opcodeopcode; uint32_tvendor_err; uint32_tbyte_len; uint32_timm_data; uint32_tqp_num; uint32_tsrc_qp; intwc_flags; uint16_tpkey_index; uint16_tslid; uint8_tsl; uint8_tdlid_path_bits; }; 13 Application accessed fields Provider must fill out all fields, even those ignored by the app Developer must determine if fields apply to their QP App must check both return code and status to determine if a request completed successfully Single structure is 48 bytes likely to cross cacheline boundary Provider must handle all types of completions from any QP

Scalable Completion Interfaces Application optimized code paths based on usage model Use compact data structures –Only needed data exchanged across interface –Limited to fields required by application –Separate addressing from completion data –Report errors ‘out of band’ Per CQ operations –Support multiple wait objects –Allow provider to optimize event signaling 14

SFI – Events 15 read CQoptimized CQ Generic completion Op context Send: +4-6 writes, +2 branches Recv: writes, +4 branches Send: +4-6 writes, +2 branches Recv: writes, +4 branches +1 write, +0 branches App selects completion structure Support provider updating counters

16 Is there anything else behind this proposal? I have two more puzzle pieces. App-centric

Application Interface Mismatch MVAPICH2-2.0rc1 (latest) code is used with default configuration options (CH3:mrail) All userspace instructions are counted for full execution of MPI_Isend Memory copies and locks are also included in the component that uses them MVAPICH2 lib compile flags: ‘-O3 –DNDEBUG –ipo’ App compile flags: ‘-O3 –DNDEBUG –ipo -finline-limit= no-inline-factor -inline-max-per-routine= inline-max-per-compile= Bstatic -lmpich -Bdynamic -lopa -lmpl -libverbs -libumad -libmad -lrdmacm -lrt -lpthread Instructions retired in MPI_Isend Lookup connection, check memory registration, formatting requests, etc. 17

Application-Centric Interfaces Collect application requirements Identify common, fast path usage models –Too many use cases to optimize them all Build primitives around fabric services –Not device specific interface 18 Reducing instruction count requires a better application impedance match

Application-Centric Interfaces Myth: app-centric interfaces imply more overhead –Poor implementations result in poor performance –Difficult to use APIs are likely to result in poor implementations –Provider knows best method for accessing their HW –These are still low-level interfaces (C), just not device interfaces (assembly) 19

Application Configured Interfaces 20 lg. msg RMA NIC Message Queue Ops RMA Ops Endpoint Communication type Capabilities Data transfer flags sm. msg inline send send write read Provider directs app to best API sets App specifies comm model

21 What’s the purple piece representing again? Extensible

Extensible Framework Take growth into consideration Reduce effort to incorporate new application features –Addition of new interfaces, structures, or fields –Modification of existing functions Allow time to design new interfaces correctly –Support prototyping interfaces prior to integration 22 Focus on longer-lived interfaces – software leading hardware

Future Extensions Design framework and APIs with anticipated capabilities –Stage delivering features Documentation defines supported usage models Use static inline calls to simplify application interactions with objects –Convert object-oriented model to procedural model 23

Claim These concepts are necessary, not revolutionary –Communication addressing, optimized data transfers, app- centric interfaces, future looking Want a solution where the pieces fit tightly together 24

Thank you! 25