Www.openfabrics.org Reliable Multicast (RMC) Liran Liss Mellanox Technologies Inc.

Slides:



Advertisements
Similar presentations
CSE 413: Computer Networks
Advertisements

Push Technology Humie Leung Annabelle Huo. Introduction Push technology is a set of technologies used to send information to a client without the client.
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 OSI Transport Layer Network Fundamentals – Chapter 4.
Data and Computer Communications Eighth Edition by William Stallings Lecture slides by Lawrie Brown Chapter 2 – Protocol Architecture, TCP/IP, and Internet-Based.
© 2007 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.1 Computer Networks and Internets with Internet Applications, 4e By Douglas.
Chapter 4 Network Layer slides are modified from J. Kurose & K. Ross CPE 400 / 600 Computer Communication Networks Lecture 14.
William Stallings Data and Computer Communications 7 th Edition (Selected slides used for lectures at Bina Nusantara University) Transport Layer.
1 Network Layer: Host-to-Host Communication. 2 Network Layer: Motivation Can we built a global network such as Internet by extending LAN segments using.
1 Link Layer & Network Layer Some slides are from lectures by Nick Mckeown, Ion Stoica, Frans Kaashoek, Hari Balakrishnan, and Sam Madden Prof. Dina Katabi.
Application Layer Multicast for Earthquake Early Warning Systems Valentina Bonsi - April 22, 2008.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Transport Protocols Slide 1 Transport Protocols.
1 CCNA 2 v3.1 Module Intermediate TCP/IP CCNA 2 Module 10.
Ch 23 1 Based on Data Communications and Networking, 4th Edition. by Behrouz A. Forouzan, McGraw-Hill Companies, Inc., 2007 Ameera Almasoud.
COE 342: Data & Computer Communications (T042) Dr. Marwan Abu-Amara Chapter 2: Protocols and Architecture.
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
Switching Techniques Student: Blidaru Catalina Elena.
TRANSPORT LAYER T.Najah Al-Subaie Kingdom of Saudi Arabia Prince Norah bint Abdul Rahman University College of Computer Since and Information System NET331.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
Introduction to Network Layer. Network Layer: Motivation Can we built a global network such as Internet by extending LAN segments using bridges? –No!
Chapter 2 – X.25, Frame Relay & ATM. Switched Network Stations are not connected together necessarily by a single link Stations are typically far apart.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.
Introduction to Networks CS587x Lecture 1 Department of Computer Science Iowa State University.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
InfiniBand Routing Solution Approach Yaron Haviv, CTO, Voltaire
ARMADA Middleware and Communication Services T. ABDELZAHER, M. BJORKLUND, S. DAWSON, W.-C. FENG, F. JAHANIAN, S. JOHNSON, P. MARRON, A. MEHRA, T. MITTON,
SPREAD TOOLKIT High performance messaging middleware Presented by Sayantam Dey Vipin Mehta.
CS332, Ch. 26: TCP Victor Norman Calvin College 1.
On the use of Reliable Multicast for Content Distribution Vassilis Chatzigiannakis
Switching breaks up large collision domains into smaller ones Collision domain is a network segment with two or more devices sharing the same Introduction.
The Transmission Control Protocol (TCP) Application Services (Telnet, FTP, , WWW) Reliable Stream Transport (TCP) Connectionless Packet Delivery.
Chi-Cheng Lin, Winona State University CS 313 Introduction to Computer Networking & Telecommunication Data Link Layer Part I – Designing Issues and Elementary.
Lab 2 Group Communication Farnaz Moradi Based on slides by Andreas Larsson 2012.
7/26/ Design and Implementation of a Simple Totally-Ordered Reliable Multicast Protocol in Java.
Computer Networks with Internet Technology William Stallings
COP 4930 Computer Network Projects Summer C 2004 Prof. Roy B. Levow Lecture 3.
RDMA Bonding Liran Liss Mellanox Technologies. Agenda Introduction Transport-level bonding RDMA bonding design Recovering from failure Implementation.
Farnaz Moradi Based on slides by Andreas Larsson 2013.
Datacenter Fabric Workshop August 22, 2005 Reliable Datagram Sockets (RDS) Ranjit Pandit SilverStorm Technologies
William Stallings Data and Computer Communications
Lecture 4 Overview. Ethernet Data Link Layer protocol Ethernet (IEEE 802.3) is widely used Supported by a variety of physical layer implementations Multi-access.
CS603 Fault Tolerance - Communication April 17, 2002.
Push Technology Humie Leung Annabelle Huo. Introduction Push technology is a set of technologies used to send information to a client without the client.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Protocols and Architecture Slide 1 Use of Standard Protocols.
Network Models.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Protocol Layering Chapter 11.
Multicast Communications
1 Transport Layer: Basics Outline Intro to transport UDP Congestion control basics.
Switching. Circuit switching Message switching Packet Switching – Datagrams – Virtual circuit – source routing Cell Switching – Cells, – Segmentation.
© 2002, Cisco Systems, Inc. All rights reserved..
3. END-TO-END PROTOCOLS (PART 1) Rocky K. C. Chang Department of Computing The Hong Kong Polytechnic University 22 March
SC’13 BoF Discussion Sean Hefty Intel Corporation.
Computer Networking A Top-Down Approach Featuring the Internet Introduction Jaypee Institute of Information Technology.
Data and Computer Communications Chapter 2 – Protocol Architecture, TCP/IP, and Internet-Based Applications.
Network Models.
High Performance and Reliable Multicast over Myrinet/GM-2
Instructor Materials Chapter 9: Transport Layer
5. End-to-end protocols (part 1)
Simple Connectivity Between InfiniBand Subnets
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Reliable group communication
Data Communication and Computer Networks
Switching Techniques.
Seminar Mobilkommunikation Reliable Multicast in Wireless Networks
Computer Networking A Top-Down Approach Featuring the Internet
Computer Networks Protocols
Error Checking continued
Design and Implementation of OverLay Multicast Tree Protocol
Presentation transcript:

Reliable Multicast (RMC) Liran Liss Mellanox Technologies Inc.

2 Agenda  Introduction  Model  ConnectX RMC Implementation  Semantics  API  Setup and operation  Scalability  Future work

3 Introduction  RMC is a model that establishes multicast communication using reliable connection (RC) service in Infiniband fabrics  Guarantees reliable in-order delivery of multi-packet messages  Currently defined for channel semantics (send-receive) Can be enhanced to support RDMA-W  Example applications:  Distributed analysis of massive amounts of data  Scaling online trading, live news and video distribution  Speeding up of high performance MPI collective operations

4 Model  Single sender / multiple receivers  Multiple receivers can exist on the same host  Multiple senders achieved using multiple RMC groups  Does not provide total-ordering  RMC group members are fixed  Not a complex group-communication protocol  Main idea  RC transport with an MGID destination Standard Send packet  Sent packets are duplicated by switches  Acks are aggregated by the sender  No changes in switch behavior

5 Model – continued RMC Responder QPz RQ RMC Responder QPy RQ RMC Responder QPx RQ Switch RMC Child QPb RMC Parent QPa RMC Child QPc RMC Child QPd SQ LID 0 LID 2 LID 3 LID 1 RQP:x DLID:1 RQP:c DLID:0 RQP:d DLID:0 RQP:y DLID:2 RQP:0xffffff DLID: MLID RQP:z DLID:3 RQP:b DLID:0 Each RMC group requires a unique MGID RMC Parent allows an 0xffffff RQP RMC responder skips DestQP match

6 ConnectX RMC Implementation  RMC Parent QP  Owns the SQ  Aggregates acks from children in HW  Reports SEND completions  Retries sends on timeout Normal RC behavior  Child QP  Provides a context for receiving acks from a single responder  Reports acks to parent  Responder QP  Virtually connected  Accepts MC packets and sends RC acks as usual  Reports RECV completions

7 Semantics  Send WQEs are completed only if all responders have acknowledged  Receive WQEs are completed as usual  Messages are delivered independently of other responders  Any single responder that ceases to reply will eventually cause the sender QP to transition into error state  All posted WQEs that have not completed will be flushed  A subset of these WQEs may have been delivered to some of the responders  This subset is not reported  Active responders are not notified

8 API  Userspace only (at the moment)  That’s it! --- libibverbs.orig/include/infiniband/verbs.h +++ libibverbs/include/infiniband/verbs.h -401,7 +401,9 enum ibv_qp_type { IBV_QPT_RC = 2, IBV_QPT_UC, IBV_QPT_UD, -IBV_QPT_XRC +IBV_QPT_XRC, +IBV_QPT_RMC_PAR, +IBV_QPT_RMC_CHILD, +IBV_QPT_RMC_RESP }; struct ibv_qp_cap { -421,6 +423,8 struct ibv_qp_init_attr { enum ibv_qp_typeqp_type; intsq_sig_all; struct ibv_xrc_domain *xrc_domain; +intnum_rmc_children; +uint32_trmc_par_qp_num; };

9 RMC setup  Assume MGID ‘M’ and ‘N’ responders  Sender  Create parent QP and modify to RTS QP type: IBV_QPT_RMC_PAR num_rmc_children: N  Create child QPs (one per responder) QP type: IBV_QPT_RMC_CHILD rmc_par_qp_num:  Join (create) M  Responder(s)  Create responder QP and modify to RTR QP type: IBV_QPT_RMC_RESP Initial PSN must match sender  Attach responder QP and join M  End-to-end flow control must be disabled on all QPs

10 RMC Operation  Initialization  Set up parent and child QPs  Set up responder QPs  Prepost receive WQEs to responder QPs Flow control is application responsibility (E2E credits are disabled)  Synchronize between sender and responder(s)  Sender  Post Send WQEs to parent QP (ibv_post_send)  Detect completions on CQ associated with parent QP  Receiver  Post Receive WQEs to responder QPs (ibv_post_recv)  Detect completions on associated CQs

11 Scalability  Resource utilization  Each MC tree uses a unique GID  Each MC tree uses N QPs at the sender Can be alleviated using a MC tree hierarchy  All-to-all RMC  N RMC trees Each host handles N 2 QPs and N MGIDs Suitable for small groups only  Hierarchal RMC trees  Single-sender Dedicated node dispatches MC messages on behalf of others

12 Future Work  Abstract setup and connection establishment  CMA support  Extend to All-to-all (multiple RMC setup)  Expose to kernel API  Add RDMA-W support

13 Summary  RMC is an efficient mechanism for distributing large amounts of data to multiple hosts  Efficient network utilization (switch replication)  Minimal SW overheads  Supported by IB architecture with minor host-side modifications  Implemented in ConnectX HW  API patches to be submitted for review soon

14