RDMA Bonding Liran Liss Mellanox Technologies. Agenda Introduction Transport-level bonding RDMA bonding design Recovering from failure Implementation.

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

OFED TCP Port Mapper Proposal June 15, Overview Current NE020 Linux OFED driver uses host TCP/IP stack MAC and IP address for RDMA connections Hardware.
Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.
PlanetLab Operating System support* *a work in progress.
RDS and Oracle 10g RAC Update Paul Tsien, Oracle.
Fibre Channel over InfiniBand Dror Goldenberg Mellanox Technologies.
IWARP Update #OFADevWorkshop.
RoCEv2 Update from the IBTA
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Research Agenda on Efficient and Robust Datapath Yingping Lu.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
COS 461: Computer Networks
 The Open Systems Interconnection model (OSI model) is a product of the Open Systems Interconnection effort at the International Organization for Standardization.
Virtual LANs. VLAN introduction VLANs logically segment switched networks based on the functions, project teams, or applications of the organization regardless.
IB ACM InfiniBand Communication Management Assistant (for Scaling) Sean Hefty.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
Windows Internet Connection Sharing Dave Eitelbach Program Manager Networking And Communications Microsoft Corporation.
SRP Update Bart Van Assche,.
Windows RDMA File Storage
Signature Verbs Extension Richard L. Graham. Data Integrity Field (DIF) Used to provide data block integrity check capabilities (CRC) for block storage.
NetworkProtocols. Objectives Identify characteristics of TCP/IP, IPX/SPX, NetBIOS, and AppleTalk Understand position of network protocols in OSI Model.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
ACM 511 Chapter 2. Communication Communicating the Messages The best approach is to divide the data into smaller, more manageable pieces to send over.
LWIP TCP/IP Stack 김백규.
Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
LWIP TCP/IP Stack 김백규.
Introduction to Networks CS587x Lecture 1 Department of Computer Science Iowa State University.
Chapter Three Network Protocols By JD McGuire ARP Address Resolution Protocol Address Resolution Protocol The core protocol in the TCP/IP suite that.
InfiniBand Routing Solution Approach Yaron Haviv, CTO, Voltaire
Scalable name and address resolution infrastructure -- Ira Weiny/John Fleck #OFADevWorkshop.
 TCP (Transport Control Protocol) is a connection-oriented protocol that provides a reliable flow of data between two computers.  TCP/IP Stack Application.
High Availability through the Linux bonding driver
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
Storage Interconnect Requirements Chen Zhao, Frank Yang NetApp, Inc.
Reconsidering Internet Mobility Alex C. Snoeren, Hari Balakrishnan, M. Frans Kaashoek MIT Laboratory for Computer Science.
SW REVERSE JEOPARDY Chapter 1 CCNA2 SW Start-up Routing table Routing table Router parts Router parts Choosing a path Choosing a path Addressing Pot.
Infiniband and RoCEE Virtualization with SR-IOV
InfiniBand support for Socket- based connection model by CM Arkady Kanevsky November 16, 2005 version 4.
Basic Routing Principles V1.2. Objectives Understand the function of router Know the basic conception in routing Know the working principle of router.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
iSER update 2014 OFA Developer Workshop Eyal Salomon
OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.
ISER on InfiniBand (and SCTP). Problem Statement Currently defined IB Storage I/O protocol –SRP (SCSI RDMA Protocol) –SRP does not have a discovery or.
Internet Protocol Storage Area Networks (IP SAN)
Renesas Electronics America Inc. © 2010 Renesas Electronics America Inc. All rights reserved. Overview of Ethernet Networking A Rev /31/2011.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 6: Static Routing Routing Protocols.
STORAGE ARCHITECTURE/ MASTER): Where IP and FC Storage Fit in Your Enterprise Randy Kerns Senior Partner The Evaluator Group.
J. Liebeher (modified by M. Veeraraghavan) 1 Introduction Complexity of networking: An example Layered communications The TCP/IP protocol suite.
On Demand Paging (ODP) Update Liran Liss Mellanox Technologies.
Reliable Multicast (RMC) Liran Liss Mellanox Technologies Inc.
© 2007 EMC Corporation. All rights reserved. Internet Protocol Storage Area Networks (IP SAN) Module 3.4.
Under the Hood with NVMe over Fabrics
Represented BY:- Allauddin Ahmad.  What it is?  OSI model.  History.  Objectives.  Encapsulation and decapsulation.  Multiplexing and demultiplexing.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
Agenda About us Why para-virtualize RDMA Project overview Open issues
Overview QEMU VM VMWare pvrdma Func1: RoCE Func0: ETH ETH
OSI model vs. TCP/IP MODEL
LWIP TCP/IP Stack 김백규.
Scaling the Network: The Internet Protocol
Fabric Interfaces Architecture – v4
Introduction to Networks
Virtual LANs.
Storage Virtualization
RoCEE in OFED Update Liran Liss, Mellanox Technologies March 15, 2010
CS703 - Advanced Operating Systems
Scaling the Network: The Internet Protocol
Ch 17 - Binding Protocol Addresses
Presentation transcript:

RDMA Bonding Liran Liss Mellanox Technologies

Agenda Introduction Transport-level bonding RDMA bonding design Recovering from failure Implementation March 30 – April 2, 2014#OFADevWorkshop2

Bonding (Link Aggregation) Bond together multiple physical links into a single aggregate logical link Motivation –Aggregate bandwidth (active-active) Distribute communication flows across all active links –High availability (active-backup) If a link goes down, reassign traffic to remaining links Can we do the same for HCAs? March 30 – April 2, 2014#OFADevWorkshop3

Link-level Bonding Example: Ethernet link aggregation Typically accomplished by a “Bonding” pseudo network interface Placed between the L3/4 stack and physical interfaces –Multiplexes packets across stateless network interfaces –Transparent to higher levels of the stack –Transport is implemented in SW RDMA challenge –Transport implemented at stateful network interfaces (HCAs) March 30 – April 2, 2014#OFADevWorkshop4 netdev1 netdev2 Bonding IP TCP UDP Sockets Application subnet1 Packets

Session-level Bonding Example: iSCSI Initiator establishes a session with Target –Session may comprise multiple TCP flows Connections are completely encapsulated within the iSCSI session –OS issues SCSI commands Alternatively, multiple sessions may be created to the same target/LUN –May be presented as single logical LUN by multi-path SW RDMA challenge –Transport connections visible to ULPs –Multiple RDMA consumers March 30 – April 2, 2014#OFADevWorkshop5 SCSI subsystem iSCSI HBA I-T Session TCP1 TCP2 netdev1 netdev2 SCSI CMDs subnet1

Idea: Transport-level Bonding Provided by a pseudo-HCA (vHCA) Applications open virtual resources –vPDs, vQPs, vSRQs, vCQs, vMRs –Mapped to physical resources by vHCA Namespace translated on the fly –Similar to transparent RDMA migration IBM/OSU “Nomad” paper VMware vRDMA Oracle live-migration prototype Link aggregation –Distribute QPs across HCAs –Optionally bond different HCA types –Upon failover Reconnect over a different device/port Continue traffic from the point of failure –Transparent migration is a special HA case March 30 – April 2, 2014#OFADevWorkshop6 subnet2subnet1 RDMA HAL and services Application IB HCA RoCE HCA Bonding HCA driver Verbs

Requirements Support aggregate across different physical HCAs –Optionally even different device types HW independent Bonding driver Strict semantics –Adhere to transport message ordering guarantees –Global visibility of all IO operations Transparent to consumers –Including failover events High performance March 30 – April 2, 2014#OFADevWorkshop7

Design User-space solution –Bond driver is a Verbs provider –Uses RDMACM internally To open connections Negotiate state using private data IP addressing –GID = IP –QPN = Port number –HCA identity = alias IP 1:1 virtual  physical QP mapping –Leverage HW ordering guarantees –Zero copy messages Fast path done in app context –Post_Send(), Post_Recv(), PollCQ() March 30 – April 2, 2014#OFADevWorkshop8 Vendor driver2 libibvers Vendor driver1 RDMA bond Kernel drivers rdmacm Application U K

RDMA Bond Object Relations (Example) March 30 – April 2, 2014#OFADevWorkshop9 vMR1 vPD1 HCA2 PD24 QP9 vQP1 vRQ Connection RDMA ID Connection RDMA ID Listener RDMA ID Listener RDMA ID MR2 HCA1 PD83 CQ69 MR2 vSQ vMR2 vQP2 vRQ Connection RDMA ID Connection RDMA ID Listener RDMA ID Listener RDMA ID vSQ QP3 CQ17 vCQ1 vRKeyRKey vRKeyRKey

Posting WRs If vQP is not in a suitable state or virtual queue is full –Return immediate error Enqueue WR in virtual Queue If associated HW Send / Receive queue is full –Return with success For Sends –If connection is not active Schedule (re)connection and return with success –For UD Resolve AH and remote QPN (if not already cached) –For RDMA Resolve RKey (if not already cached) For Receives –If connection is not active, return with success Translate local SGE Post to HW March 30 – April 2, 2014#OFADevWorkshop10

Polling Completions Poll next HW CQ associated with vCQ If not empty, process according to status –Case IBV_WC_RETRY_EXC_ERR Schedule reconnection for associated vQP Ignore completion –Case IBV_WC_WR_FLUSH_ERR Ignore completion –Case IBV_WC_SUCCESS Report successful completion –Default (any other error) Modify vQP to error Report erroneous completion Add corresponding virtual Queue to CQ error list Poll next virtual queue on error list If it has in-flight WQEs –Generate ERROR_FLUSH for next WQE Report CQ empty if none of the above applies March 30 – April 2, 2014#OFADevWorkshop11

RC Failure Recovery Re-establish connection –Over any active link and device Negotiate last committed operations –Generate corresponding completions Rewind physical queues –Resume operation March 30 – April 2, 2014#OFADevWorkshop12 Send Queue Receive Queue Receive Queue virtual producer Virtual consumer Physical producer virtual producer Virtual consumer Physical producer

RC Failure Recovery March 30 – April 2, 2014#OFADevWorkshop13 Send Queue Receive Queue Receive Queue virtual producer Virtual consumer Physical producer virtual producer Virtual consumer Physical producer Re-establish connection –Over any active link and device Negotiate last committed operations –Generate corresponding completions Rewind physical queues –Resume operation

RC Failure Recovery March 30 – April 2, 2014#OFADevWorkshop14 Send Queue Receive Queue Receive Queue virtual producer Virtual consumer virtual producer Virtual consumer Re-establish connection –Over any active link and device Negotiate last committed operations –Generate corresponding completions Rewind physical queues –Resume operation

RC Failure Recovery March 30 – April 2, 2014#OFADevWorkshop15 Send Queue Receive Queue Receive Queue virtual producer Virtual consumer virtual producer Virtual consumer Re-establish connection –Over any active link and device Negotiate last committed operations –Generate corresponding completions Rewind physical queues –Resume operation

RC Failure Recovery March 30 – April 2, 2014#OFADevWorkshop16 Send Queue Receive Queue Receive Queue virtual producer Virtual consumer Physical producer virtual producer Virtual consumer Physical producer Re-establish connection –Over any active link and device Negotiate last committed operations –Generate corresponding completions Rewind physical queues –Resume operation

Implementation (Ongoing) Current status –POC implementation –Supported objects CQs PDs RC QPs MRs –Supported operations Resource manipulation Send-receive data traffic –QPs limited to single link Tackle transient link failure Next steps –Complete Verbs coverage –RDMACM integration –Multi-link recovery Continuously negotiate active links –Aggregation schemes HA RR Static load balancing Dynamic load balancing March 30 – April 2, 2014#OFADevWorkshop17

Summary Bonding solution for stateful RDMA devices –HW agnostic –Aggregates ports from different devices –Communicating peers must run the Bonding driver Out-of-band protocol via CM MADs Supports –High availability –Aggregate BW –Load balancing –Transparent migration Efficient user-space implementation –Could be extended to the kernel in a similar manner March 30 – April 2, 2014#OFADevWorkshop18

#OFADevWorkshop Thank You