5/3/2018 3:51 AM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

Slides:



Advertisements
Similar presentations
Finishing Flows Quickly with Preemptive Scheduling
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Traffic Engineering with Forward Fault Correction (FFC)
Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
RDS and Oracle 10g RAC Update Paul Tsien, Oracle.
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
Transport Layer Services –Reliable Delivery –or Not! Protocols –Internet: TCP, UDP –ISO: TP0 thru TP4.
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
Information-Agnostic Flow Scheduling for Commodity Data Centers
Junxian Huang 1 Feng Qian 2 Yihua Guo 1 Yuanyuan Zhou 1 Qiang Xu 1 Z. Morley Mao 1 Subhabrata Sen 2 Oliver Spatscheck 2 1 University of Michigan 2 AT&T.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
SEPT, 2005CSI Part 2.2 Protocols and Protocol Layering Robert Probert, SITE, University of Ottawa.
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.
Lecture 3 Review of Internet Protocols Transport Layer.
Improving QoS Support in Mobile Ad Hoc Networks Agenda Motivations Proposed Framework Packet-level FEC Multipath Routing Simulation Results Conclusions.
1 Chapter 16 Protocols and Protocol Layering. 2 Protocol  Agreement about communication  Specifies  Format of messages (syntax)  Meaning of messages.
Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
Challenges to Reliable Data Transport Over Heterogeneous Wireless Networks.
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,
1 Protocols and Protocol Layering. 2 Protocol Agreement about communication Specifies –Format of messages –Meaning of messages –Rules for exchange –Procedures.
CSE 331: Introduction to Networks and Security Fall 2000 Instructor: Carl A. Gunter Slide Set 2.
CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Revisiting Transport Congestion Control Jian He UT Austin 1.
Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
ICTCP: Incast Congestion Control for TCP in Data Center Networks By: Hilfi Alkaff.
SketchVisor: Robust Network Measurement for Software Packet Processing
HTCC coffee march /03/2017 Sébastien VALAT – CERN.
Learn how the cloud is accelerating network transformation
Chen Qian, Xin Li University of Kentucky
Yiting Xia, T. S. Eugene Ng Rice University
CIS 700-5: The Design and Implementation of Cloud Networks
Resilient Datacenter Load Balancing in the Wild
6.888 Lecture 5: Flow Scheduling
OTCP: SDN-Managed Congestion Control for Data Center Networks
Satellite TCP Lecture 19 04/10/02.
Alternative system models
SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
Multipath TCP Yifan Peng Oct 11, 2012
11/13/ :11 PM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,
Be Fast, Cheap and in Control
TCP in Mobile Ad-hoc Networks
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Yiannis Nikolakopoulos
Fast Congestion Control in RDMA-Based Datacenter Networks
RDMA over Commodity Ethernet at Scale
Network Systems and Throughput Preservation
Centralized Arbitration for Data Centers
Protocols and Protocol Layering
Lecture 16, Computer Networks (198:552)
Lecture 17, Computer Networks (198:552)
Protocols and Protocol Layering
Review of Internet Protocols Transport Layer
Error Checking continued
Towards Hyperscale HPC & RDMA
Elmo Muhammad Shahbaz Lalith Suresh, Jennifer Rexford, Nick Feamster,
A Closer Look at NFV Execution Models
Towards Stateless RNIC for Data Center Networks
Efficient Migration of Large-memory VMs Using Private Virtual Memory
SocksDirect: Datacenter Sockets can be Fast and Compatible
9/28/2019 8:35 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
NetWarden: Mitigating Network Covert Channels without Performance Loss
Presentation transcript:

5/3/2018 3:51 AM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3, Bojie Li1,2, Jiansong Zhang2, Yongqiang Xiong2, Peng Cheng2, Enhong Chen1 1USTC, 2Microsoft Research, 3BUAA © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Existing Hardware-based Transports RDMA (RoCEv2) High throughput, low latency and zero-cpu overhead Go-back-N, PFC required Light-weight Transport Layer (LTL)[1] Inter-FPGA communication [1] Adrian M Caulfield, et al. A cloud-scale acceleration architecture. MICRO 2016.

Loss is Not Uncommon in DCs Loss Sources Congestion loss: can be removed by PFC Failure loss: hard to be mitigated at large scale Statistics from real DCs A persistent 0.2% silent random drop which lasts for 2 days [2] Loss ratio of some network switch can be up to 2% [2] [2] Chuanxiong Guo, et al. Pingmesh: A large-scale system for data center network latency measurement and analysis. ACM SIGCOMM 2015

Inefficient Loss Recovery Go-Back-N Tput drops to near zero when loss ratio is >0.1%

How can we do better? SACK TCP: state of the art loss recovery Recover multiple losses within a sending window Performs well under even high packet loss ratio, e.g. 1% Question: can we extend selective retransmission to hardware-based transport? Answer: MELO Memory-Efficient LOss recovery Implements Selective Retransmission for hardware transport

Challenges Hardware on-chip memory is limited On-chip memory is usually a cache for off-chip memory Swapping between on-chip and off-chip is expensive Loss recovery for hardware transport should be on-chip memory-efficient Figure borrowed from: Anuj Kalia, et al. Design Guidelines for High Performance RDMA Systems. ATC 2016

Extra Memory is Required #1: Re-ordering buffer is required to store loss induced out-of-order data #2: Meta data is required for tracking out-of-order packets

Challenge #1: Re-Ordering Buffer The necessity of re-ordering buffer RDMA requires in-order message placement LTL requires in-order packet delivery Size of out-of-order data BDP: 100G * 100us ~ 1.25MB Too much for on-chip memory A re-ordering buffer is required before placing out-of-order data into app buffer Flag Read 3  ERROR Ready X 5 4 3 2 1 5 4 3 2 1

A per-connection bitmap would take too much on-chip memory Challenge #2: Meta Data Selective retransmission bitmap size BDP: 100G * 100us (1KB MTU) ~ 150B Too much for on-chip memory: Mellanox CX3 Pro NIC, per-connection transport states is 248B A per-connection bitmap would take too much on-chip memory

MELO Separate data and meta data storage Re-ordering buffer in off-chip memory Off-chip memory is large in size, usually GBs Meta data in on-chip memory On-chip memory is fast for frequent query and update A shared bits pool structure to reduce on-chip meta data size

Off-Chip Re-Ordering Buffer Off-chip memory is usually GBs A BDP worth of buffer is allocated to each connection Buffer base address is stored in on-chip memory An extra copy is required For RDMA, trade some CPU for re-ordering buffer For LTL, FPGA will perform the copy

On-Chip Shared Bits Pool Intuition The bitmap of all connections is not fully occupied simultaneously The total size of occupied bits is bounded by ingress bandwidth, O(n)  O(1) Instead of per-connection bitmap, we can use a shared bits pool … … Ingress Bandwidth is Bottleneck

Shared Bits Pool Management Bits pool is divided into equal sized blocks Eliminate external fragmentation Allocation unit: 1 Byte, minimize internal fragmentation Linked List: blocks linked into bitmap A stack for maintaining available blocks Blocks are allocated/returned from/to the top of the stack Management overhead: 2 * Bits_Pool_Size In total 23B extra data for each connection

Shared Bits Pool Animation avail_ptr nxt 1 nxt 1 2 nxt 2 ... ... ... 127 nxt 127 Bits Pool Available Stack Connection #1 Connection #2

Access Pattern to Bitmap Linked list is not friendly for random access Fortunately, access pattern is not random RCV.HIGH: new data RCV.NXT: retransmission data Other parts: out-of-order retransmission data  ignore and drop

Evaluation Setup Topology Loss NS3: integrate MELO into DCQCN Leaf-spine topology 40G-100G RTT: 16us MTU: 1KB Bits Pool Size: 1Kb Loss Random loss at L0

Preliminary Results Throughput: 2.14X under 0.1% loss ratio Tail Latency: 2.11X ~ 3.22X reduced

Micro Benchmarks CPU overhead: 30% core at 0.2% loss ratio Bits pool usage: always < 100% Interesting results: more concurrent connections leads to lower bits pool usage

Future Implementation FPGA implementation FPGA captures all the outgoing/incoming traffic sequence number Implements MELO FPGA with MELO can deployed in DC to enhance RDMA NIC loss recovery Plug in and play

Future Work Further Optimization Remove PFC Consecutive 1s or 0s can be compressed Remove PFC MELO + end-to-end congestion control  remove PFC?

Conclusion MELO implements Selective Retransmission for hardware-based transport MELO architecturally separates data and meta data storage MELO leverages a shared bits pool structure to reduce on-chip memory size

Thank you!

Q& A

Backup slides