RDMA over Commodity Ethernet at Scale

Slides:



Advertisements
Similar presentations
Logically Centralized Control Class 2. Types of Networks ISP Networks – Entity only owns the switches – Throughput: 100GB-10TB – Heterogeneous devices:
Advertisements

Evolution Series Edge Release R3A00 Bridging Societies.
OpenFlow overview Joint Techs Baton Rouge. Classic Ethernet Originally a true broadcast medium Each end-system network interface card (NIC) received every.
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
Chuanxiong Guo, Haitao Wu, Kun Tan,
CS335 Networking & Network Administration Tuesday, April 20, 2010.
03/12/08Nuova Systems Inc. Page 1 TCP Issues in the Data Center Tom Lyon The Future of TCP: Train-wreck or Evolution? Stanford University
ICTCP: Incast Congestion Control for TCP in Data Center Networks∗
Microsoft Virtual Academy Module 4 Creating and Configuring Virtual Machine Networks.
Practical TDMA for Datacenter Ethernet
The NE010 iWARP Adapter Gary Montry Senior Scientist
Srihari Makineni & Ravi Iyer Communications Technology Lab
VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.
Ethernet. Ethernet standards milestones 1973: Ethernet Invented 1983: 10Mbps Ethernet 1985: 10Mbps Repeater 1990: 10BASE-T 1995: 100Mbps Ethernet 1998:
Switching Topic 2 VLANs.
Cisco Confidential © 2013 Cisco and/or its affiliates. All rights reserved. 1 Cisco Networking Training (CCENT/CCT/CCNA R&S) Rick Rowe Ron Giannetti.
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,
IEEE Inter Networking Shortest Path Bridging Security Audio/Video Bridging DCB (802.1) Task Group PFC 802.1Qbb ETS 802.1Qaz DCBX 802.1Qaz QCN 802.1Qau.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 VLANs.
An open source user space fast path TCP/IP stack and more…
Basic Edge Core switch Training for Summit Communication.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
ICTCP: Incast Congestion Control for TCP in Data Center Networks By: Hilfi Alkaff.
Ethernet Packet Filtering – Part 2 Øyvind Holmeide 10/28/2014 by.
Ethernet Packet Filtering - Part1 Øyvind Holmeide Jean-Frédéric Gauvin 05/06/2014 by.
Network Virtualization Ben Pfaff Nicira Networks, Inc.
VL2: A Scalable and Flexible Data Center Network
Advanced Network Tap application for
COS 561: Advanced Computer Networks
Ready-to-Deploy Service Function Chaining for Mobile Networks
RDMA Deployments: From Cloud Computing to Machine Learning
Youngstown State University Cisco Regional Academy
CCNA Practice Exam Questions
HULA: Scalable Load Balancing Using Programmable Data Planes
5/3/2018 3:51 AM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,
Link Layer 5.1 Introduction and services
Selecting Unicast or Multicast Mode
ETHANE: TAKING CONTROL OF THE ENTERPRISE
Advanced Computer Networks
HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,
Scaling the Network: The Internet Protocol
Network Performance - Theory
Network Data Plane Part 2
Chapter 6: Network Layer
Chapter 4 Data Link Layer Switching
Transport Protocols over Circuits/VCs
Addressing: Router Design
Braindumps4IT Braindumps Ream Exam Questions Answers
Multi-PCIe socket network device
Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
What’s “Inside” a Router?
11/13/ :11 PM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,
Routing and Switching Essentials v6.0
Chuanxiong Guo, Haitao Wu, Kun Tan,
Toward Effective and Fair RDMA Resource Sharing
Switching, routing, and flow control in interconnection networks
Bridges and Extended LANs
NTHU CS5421 Cloud Computing
Open vSwitch HW offload over DPDK
Scaling the Network: The Internet Protocol
LAN switching and Bridges
Requirements Definition
Ch 17 - Binding Protocol Addresses
Myrinet 2Gbps Networks (
Bridges Neil Tang 10/10/2008 CS440 Computer Networks.
Towards Hyperscale HPC & RDMA
Towards Stateless RNIC for Data Center Networks
Chapter 4: outline 4.1 Overview of Network layer data plane
Presentation transcript:

RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn ACM SIGCOMM 2016 August 24 2016

Outline RDMA/RoCEv2 background DSCP-based PFC Safety challenges RDMA transport livelock PFC deadlock PFC pause frame storm Slow-receiver symptom Experiences and lessons learned Related work Conclusion

RDMA/RoCEv2 background RDMA addresses TCP’s latency and CPU overhead problems RDMA: Remote Direct Memory Access RDMA offloads the transport layer to the NIC RDMA needs a lossless network RoCEv2: RDMA over commodity Ethernet DCQCN for connection-level congestion control PFC for hop-by-hop flow control RDMA app RDMA app User DMA RDMA verbs RDMA verbs TCP/IP TCP/IP Kernel NIC driver NIC driver RDMA transport RDMA transport DMA Hardware IP IP Ethernet Ethernet Lossless network

Priority-based flow control (PFC) Egress port Ingress port Data packet p0 Hop-by-hop flow control, with eight priorities for HOL blocking mitigation The priority in data packets is carried in the VLAN tag PFC pause frame to inform the upstream to stop p0 p1 p7 p0 p1 p7 PFC pause frame p1 XOFF threshold Data packet PFC pause frame

DSCP-based PFC TOR Switch Issues of VLAN-based PFC DSCP-based PFC It breaks PXE boot No standard way for carrying VLAN tag in L3 networks DSCP-based PFC DSCP field for carrying the priority value No change needed for the PFC pause frame Supported by major switch/NIC venders Trunk mode NIC No VLAN tag when PXE boot Data packet PFC pause frame

Outline RDMA/RoCEv2 background DSCP-based PFC Safety challenges RDMA transport livelock PFC deadlock PFC pause frame storm Slow-receiver symptom Experiences and lessons learned Related work Conclusion

RDMA transport livelock Sender Receiver Receiver Sender Go-back-N RDMA Send 0 RDMA Send 1 RDMA Send N+1 NAK N RDMA Send N RDMA Send N+2 RDMA Send 0 RDMA Send 1 RDMA Send N+1 NAK N RDMA Send 2 RDMA Send N+2 Go-back-0 Switch Pkt drop rate 1/256 Sender Receiver

PFC deadlock Our data centers use Clos network Packets first travel up then go down No cyclic buffer dependency for up-down routing -> no deadlock But we did experience deadlock! Spine Podset Leaf Pod ToR Servers

PFC deadlock Preliminaries ARP table: IP address to MAC address mapping MAC table: MAC address to port mapping If MAC entry is missing, packets are flooded to all ports Input ARP table IP MAC TTL IP0 MAC0 2h IP1 MAC1 1h MAC table Dst: IP1 MAC Port TTL MAC0 Port0 10min MAC1 - Output

PFC deadlock Path: {S1, T0, La, T1, S3} La Lb Path: {S4, T1, Lb, T0, S2} PFC pause frames 2 4 3 1 Congested port p2 p3 p3 p4 Ingress port T0 T1 p0 p1 Egress port p0 p1 p2 Dead server Packet drop PFC pause frames Server S1 S2 S3 S4 S5

PFC deadlock The PFC deadlock root cause: the interaction between the PFC flow control and the Ethernet packet flooding Solution: drop the lossless packets if the ARP entry is incomplete Recommendation: do not flood or multicast for lossless traffic Call for action: more research on deadlocks

NIC PFC pause frame storm A malfunctioning NIC may block the whole network PFC pause frame storms caused several incidents Solution: watchdogs at both NIC and switch sides to stop the storm Spine layer Podset 0 Podset 1 Leaf layer ToRs 1 2 3 4 5 6 7 servers Malfunctioning NIC

The slow-receiver symptom Server ToR to NIC is 40Gb/s, NIC to server is 64Gb/s But NICs may generate large number of PFC pause frames Root cause: NIC is resource constrained Mitigation Large page size for the MTT (memory translation table) entry Dynamic buffer sharing at the ToR CPU DRAM PCIe Gen3 8x8 64Gb/s QSFP 40Gb/s MTT WQEs QPC Pause frames ToR NIC

Outline RDMA/RoCEv2 background DSCP-based PFC Safety challenges RDMA transport livelock PFC deadlock PFC pause frame storm Slow-receiver symptom Experiences and lessons learned Related work Conclusion

Latency reduction RoCEv2 deployed in Bing world-wide for one and half years Significant latency reduction Incast problem solved as no packet drops

RDMA throughput Using two podsets each with 500+ servers 5Tb/s capacity between the two podsets Achieved 3Tb/s inter-podset throughput Bottlenecked by ECMP routing Close to 0 CPU overhead

Latency and throughput tradeoff us L0 T0 L1 T1 S0,0 S0,23 S1,0 S1,23 RDMA latencies increase as data shuffling started Low latency vs high throughput Before data shuffling During data shuffling

Lessons learned Deadlock, livelock, PFC pause frames propagation and storm did happen Be prepared for the unexpected Configuration management, latency/availability, PFC pause frame, RDMA traffic monitoring NICs are the key to make RoCEv2 work Loss vs lossless: Is lossless needed?

Related work Infiniband iWarp Deadlock in lossless networks TCP perf tuning vs. RDMA

Conclusion RoCEv2 has been running safely in Microsoft data centers for one and half years DSCP-based PFC which scales RoCEv2 from L2 to L3 Various safety issues/bugs (livelock, deadlock, PFC pause storm, PFC pause propagation) can all be addressed Future work RDMA for inter-DC communications Understanding of deadlocks in data centers Lossless, low-latency and high-throughput networking Applications adoption