Towards Stateless RNIC for Data Center Networks

Slides:

Advertisements

Similar presentations

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

Advertisements

CS162 Section Lecture 9. KeyValue Server Project 3 KVClient (Library) Client Side Program KVClient (Library) Client Side Program KVClient (Library) Client.

© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 OSI Transport Layer Network Fundamentals – Chapter 4.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

Process-to-Process Delivery:

Protocols for Wide-Area Data-intensive Applications: Design and Performance Issues Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian.

Protocol(TCP/IP, HTTP) 송준화 조경민 2001/03/13. Network Computing Lab.2 Layering of TCP/IP-based protocols.

Forensic and Investigative Accounting Chapter 14 Internet Forensics Analysis: Profiling the Cybercriminal © 2005, CCH INCORPORATED 4025 W. Peterson Ave.

Sharing Information across Congestion Windows CSE222A Project Presentation March 15, 2005 Apurva Sharma.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,

C-Hint: An Effective and Reliable Cache Management for RDMA- Accelerated Key-Value Stores Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan Presented by:

Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.

Day 13 Intro to MANs and WANs. MANs Cover a larger distance than LANs –Typically multiple buildings, office park Usually in the shape of a ring –Typically.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

1 ** THE INTERNET ** Large, worldwide collection of networks that use a common protocol to communicate with each other A network of networks.

FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho,

McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.

William Stallings Data and Computer Communications

High Performance and Reliable Multicast over Myrinet/GM-2

Problem: Internet diagnostics and forensics

Chapter 9: Transport Layer

Last Class: Introduction

5/3/2018 3:51 AM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

Memory Management.

Module 12: I/O Systems I/O hardware Application I/O Interface

LWIP TCP/IP Stack 김백규.

Diskpool and cloud storage benchmarks used in IT-DSS

Alternative system models

Presented by Kristen Carlson Accardi

CS 268: Router Design Ion Stoica February 27, 2003.

Implementation of GPU based CCN Router

CHAPTER 3 Architectures for Distributed Systems

TCP Transport layer Er. Vikram Dhiman LPU.

Memory Management for Scalable Web Data Servers

Utilization of Azure CDN for the large file distribution

Software Architecture in Practice

Chapter 8: Main Memory.

Transport Layer Unit 5.

Transport Layer Our goals:

11/13/ :11 PM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

Net 431: ADVANCED COMPUTER NETWORKS

Toward Effective and Fair RDMA Resource Sharing

Computer System Overview

Chapter 20 Network Layer: Internet Protocol

Process-to-Process Delivery:

SPEAKER: Yu-Shan Chou ADVISOR: DR. Kai-Wei Ke

File Transfer Issues with TCP Acceleration with FileCatalyst

Switching Techniques.

Chapter 15 – Part 2 Networks The Internal Operating System

EE 122: Lecture 7 Ion Stoica September 18, 2001.

Performance Issues in WWW Servers

RDMA over Commodity Ethernet at Scale

Distributed Systems CS

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

CS4470 Computer Networking Protocols

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Congestion Control (from Chapter 05)

Database System Architectures

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Process-to-Process Delivery: UDP, TCP

November 26th, 2018 Prof. Ion Stoica

Transport Layer 9/22/2019.

A Closer Look at NFV Execution Models

Presentation transcript:

Towards Stateless RNIC for Data Center Networks Pulin Pan Guo Chen Xizheng Wang Huichen Dai Bojie Li Binzhang Fu Kun Tan Hello everyone, my name is Guo Chen. I’m an Associate Professor from Hunan University. Today I’m going to talk about ... This is a joint work with ... Hunan University Huawei

RDMA background RDMA is becoming prevalent in DCN Network stack in NIC Low and stable latency, e.g., < 10us; High throughput, e.g. , 100Gbps; Low CPU overhead Widely deployed in companies such as Microsoft, Alibaba, ByteDance Network stack in NIC Processing in dedicated NIC hardware Bypass kernel Zero copy To address this problem, RDMA hardware stack is becoming prevalent in DCN. Compared to TCP/IP, RDMA offers very low and stable latency, high throughput, and very low CPU utilization. With such high-performance, RDMA has been widely deployed in DCN such as in Microsoft. Actually, the ultra-performance of RDMA comes from implementing the whole network stack in the NIC. As such, the stack processing is done by dedicated high-performance NIC hardware, without involving CPU and OS kernel (which contributes the most performance variance), and directly send/receive data from/to application buffer through NIC.

Uniqueness of RDMA stack Network stack on RDMA NIC (RNIC) Maintain massive connection-related states on RNIC Memory-access related page size, WQ length, … Networking related congestion window, recv next, IP address, … e.g., MLX maintains 256B states for each RDMA connection (/include/linux/mlx4/qp.h mlx4_qp_context) Still growing… However, the side-effect of offloading network stack on NIC is that we have to maintain massive states on RDMA NIC for stack processing. Specifically, for each RDMA connection, the NIC needs to maintain memory-access related states, including page size, WQ length, etc., and networking related states, such as congestin window, recv next sequence, IP address. This figure illustrates the states that an RDMA NIC needs to maintain for stack processing for one connection. In practice, an RDMA connection is operated based on a queue pair, and its states are wrapped in a data structure called queue pair context. Such queue pair context is growing larger and larger, as the stack functionalities become more complicated, for example, for MLX4, the QP context is 256B.

States on RNIC limit the scalability Conn States RNIC Host memory … Send out Receive data/ACK NIC on-chip memory is scarce (e.g., several Mbs) Serves as a cache for connection states Fetch through PCIe Performance drops when # of concurrent connections grows State miss on RNIC Fetch states from host memory PCIe latency becomes bottleneck While offloading network stack gives RDMA ultra-high-performance, the connection states limit RDMA NIC’s scalability. Specifically, different from the DRAM on the host, the NIC on-chip memory is very scarce. Therefore, the NIC memory often serves as a cache for connection states. And the whole states of all connections are stored in the host memory when there are too many connections for the NIC memory. Consequently, the RDMA NIC performance drops a lot when there are too many concurrent connections. Specifically, many connections states are missed on the NIC. When receiving or sending data belong to those connections whose states are not on the NIC, the NIC has to fetch states from the host memory, typically through PCIe bus. Before the states have been fetched, the stack processing needs to stall since it does not know how to process. Under such condition, the PCIe latency becomes the performance bottleneck. This figure shows our measurement results on MLX4 100G RDMA NIC. We use multiple client RNICs WRITE to one server RNIC using multiple concurrent connections. Results show that when the # of concurrent connections grow beyond 280, the overall throughput drops dramatically, and converge to only about ¼ of the maximum throughput. Meanwhile, we monitored on the server’s PCIe bus and find the number of PCIe read events suddenly grows. This verifies that the NIC is frequently fetching states from the host memory.

Can we directly solve this RNIC scalability problem? Current status Applications require high performance under high concurrency, e.g., Distributed machine learning Parameter servers exchange parameters with many worker nodes Web-search back-end services Result-aggregators aggregates results from many document-lookupers Existing works try to avoid/mitigate the impact of the RNIC scalability issue, requiring careful and constrained usage of applications Using large memory pages [1], connection grouping [2], or unreliable datagram [3], … Now we know that RNIC has performance problem under high concurrency, which is due to connection states miss. But, current applications do require RNIC has high performance under high concurrency. To name a few such applications, for example, distributed machine learning system is very popular now. In such systems, the parameter servers need to frequently exchange parameters and gradients with many worker nodes. The communication speed is very critical to the whole system’s performance. Another example, in web-search back-end services, a search task is distributed to many document lookup nodes. The lookup nodes concurrently search the result, and then the results are collected and aggregated by a few aggregator nodes, then rendered to the user. In such architecture, the aggregator needs to communicate with many lookup nodes efficiently. Therefore, when applying RDMA for these application, the applications have to be very careful and often have very constrained usage on the RDMA verbs, thus to avoid or mitigate the impact of the RNIC scalability issue. For example, to support multiple connections, they use large memory pages, xxx, or even. But till now, the RNIC scalability problem has not been solved. In this work we try to directly tackle the problem Can we directly solve this RNIC scalability problem? [1] FaRM: fast remote memory, NSDI 2014 [2] Scalable RDMA RPC on Reliable Connection with Efficient Resource Sharing, EuroSys 2019 [3] FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs, OSDI 2016

StaR: Towards stateless RNIC Moving states to the other communication side Maintain zero connection-related states, while all the RDMA data plane processing is still done by NIC hardware Utilizing asymmetric communication pattern in DCN Often only one side has huge fan-in/fan-out traffic while the other side only has a few connections Parameter servers and workers in distributed machine learning systems Result-aggregators and document-lookupers in web-search back-end services … In this work, we try to solve this dilemma by StaR, which offloads the network stack on RNIC, but makes the RNIC to be still stateless. This may sound impossible at the first time. Our trick is to move the stack states to the other side. … server clients

Process based on each packet StaR overview NIC Host Stateless (Server) side NIC Host Stateful (Client) side Conn Receive packet/ACK DMA or notify apps SQ RQ CQ Local Net states Mem table SQ RQ CQ Conn Network Process based on each packet Send packet/ACK Embed server states SQ RQ CQ Remote Net states Mem table

Stateless NIC processing Example #1 Client SEND, Server RECV App: RECV App: SEND NIC Host Stateless (Server) side NIC Host Stateful (Client) side Conn SQ RQ CQ Local Net states Data packet Mem table Get ACK header info SQ RQ CQ Conn Get DMA info Encapsulate network transmission info Network Encapsulate DMA info WQEP SQ RQ CQ Remote Net states CQEP Mem table ACK

Stateless NIC processing Example #2 Server WRITE App: WRITE NIC Host Stateless (Server) side NIC Host Stateful (Client) side Conn GD packet SQ RQ CQ Local Net states Get header info Mem table SQ RQ CQ Conn Get DMA info Encapsulate DMA info Encapsulate network transmission info Network Data packet WQEP SQ RQ CQ Remote Net states Mem table

Conduct security check on the client side! Security issue Without any states, RNIC cannot conduct security check on received packets May access illegal memory address and trigger malicious traffic Conduct security check on the client side! Host APP NIC Network Stateful (client) Side Stateless (server) side Network Stack stateless processing Generate white list to the security module Security Check Trustable packet

Stateless processing V.S. normal RNIC RNIC should saturate the link bandwidth Both StaR RNIC and normal RNIC require a short data packet buffer to fill the pipeline To cover the delay of processing one packet However, normal RNICs require another connection state buffer Should be large enough to store all the connection states of those packets in the pipeline Consume a lot of memory when data packets are small Packet buffer Normal RNIC StaR RNIC Connection state buffer Require no connection state buffer!

Performance evaluation Preliminary simulation in NS3 Scenario #1: Stress test Multiple clients continuously WRITE 8B data to the server 1 outbounding WRITE at any moment 100Gbps link, 12us RTT, 1us PCIe latency, NIC memory of 300 conn states … Server (Stateless) Clients (Stateful) WRITE 160x throughput improvement

Performance evaluation Scenario #2: RPC application Multiple clients continuously call the remote procedure in one server A 2.8KB RPC request (through SEND/RECV), and an 1.4KB RPC response (through SEND/RECV) after receiving the request. 100Gbps link, 12us RTT, 1us PCIe latency , NIC memory of 300 conn states … Server (Stateless) Clients (Stateful) 4x throughput improvement RPC request (SEND/RECV) RPC response (SEND/RECV)

Implementation (ongoing) FPGA-based smart NIC Xilinx FPGA board 4 SFP+ (10Gbps), PCIE3.0x8

Implementation (ongoing)

Wrap-up RDMA achieves high-performance by offloading network stack on RNIC, but, states on RNIC limit its scalability StaR makes RNIC stateless, by moving states to the other side Utilizing the asymmetric traffic pattern in DCN Track application operations (WQE) and transmission states on the other side Ensure security of the stateless side by controlling the traffic sent out on the stateful side StaR RNIC breaks the scalability limitation, which may enable cooler RDMA applications

Q&A Thanks!