Per-packet Load-balanced, Low-Latency Routing for Clos-based Data Center Networks Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin Zheng, Haitao Wu, Yongqiang Xiong, Dave Maltz December 10 2013 Santa Barbara, California
Outline Background DRB for load-balance and low latency DRB for 100% bandwidth utilization DRB latency modeling Routing design and failure handling Evaluations Related work Conclusion
Clos-based DCN: background Topology Routing Equal-cost multi-path (ECMP) Given a spine switch, there is only one path from a src to a dst in fat-tree
Clos-based DCN: issues Low network utilization Due to flow-based hash collision for ECMP High network latency tail results high user perceived latency Many DC applications use thousands or more TCP connections
Network latency measurement 400us Network latency has a long tail Busy servers do not contribute to the long latency tail Server network stack increases the latency by several hundred us 1.5ms 2ms
Where the latency tail comes from A (temporary) congested switch port can use several MB for packet buffering 1MB buffer introduces 1ms latency for a 10G link For a three layer DCN, intra-DC communications take up to 5 hops
The challenge The challenge Given a full bisection bandwidth Clos network, achieve 100% bandwidth utilization and 0 in-network latency Many ways to improve, but none addresses the challenge fully E.g., use traffic engineering for better bandwidth utilization, introduce ECN for latency mitigation Our answer: DRB The challenge:
Digit-reversal bouncing (DRB) Right time for per-packet routing Regular Clos topology Server software stack under control Switches become open and programmable DRB Achieves 100% bandwidth utilization by per-packet routing Achieves small queuing delay by its “Digit-reversal” algorithm Can be readily implemented
Achieve 100% bandwidth utilization Sufficient condition for 100% utilization In a Fat-tree network, given an arbitrary feasible traffic matrix, if a routing algorithm can evenly spread the traffic ai,j from server i to server j among all the possible uplinks at every layer, then all the links, including all the downlinks, are not overloaded The condition implies: Oblivious load-balancing: no need of traffic matrix Packet bouncing: only need to load-balance uplinks Source-destination pair instead of flow-based
DRB for fat-tree DRB for bouncing switch selection: Seq Digit-reversal Spine switch 00 3.0 (00) 01 10 3.2 (10) 3.1 (01) 11 3.3 (11) Many ways to meet the sufficient condition: RB (random bouncing) RRB (round-robin bouncing)
Queuing latency modeling First-hop queue length vs. traffic load with switch port number 24 First-hop queue length vs. switch port number when traffic load is 0.95 DRB and RRB achieves bounded queue length when load approaches 100% Queue length of RRB in proportion to $n^2$ Queue length of DRB is very small (2-3 pkts)
DRB for VL2 Given a spine switch, there are multiple paths between a source and a destination in VL2 DRB splits a spine switch into multiple “virtual spine switches”
DRB routing and failure handling Servers choose bouncing switch for each packet Switches use static routing Switches are programmed to maintain up-to-date network topology Leverage network topology to minimize broadcast messages
Simulation: network utilization Simulation setup: Pkt level simulation with NS3 Three-layer fat-tree and VL2 with 3000+ servers Permutation traffic pattern TCP for transport protocol with 256KB buffer size Resequencing buffer for out-of-order packet arrivals
Simulation: queuing delay RRB results in large queuing delay at the first and forth hops DRB achieves the smallest queuing delay even thought its throughput is the highest
Simulations: out-of-order arrivals Resequencing delay is defined as the time a packet stays in the resequencing buffer RB’s resequencing delay is the worst Resequencing delay is not directly related to queuing delay DRB achieves very small number of out-of-order packet arrivals
Implementation and testbed Servers : perform ip-in-ip packet encap for each source-destination pair at the sending side; packet re-sequencing at the receiving side Switches: ip-in-ip packet decap; topology maintenance Testbed A three-layer fat-tree with 54 servers Each switch has 6 ports
Experiments: queuing delay RB results in large queue length (250KB per port) DRB and RRB performs similar since each switch has only 3 uplinks DRB’s queue length is only 2-3 pkts Same with the queue modeling and simulation results
Related work Random-based per-packet routing Flowlet based LocalFlow Random Packet Spraying (RPS) Random-based per packet VLB Flowlet based LocalFlow DeTail (lossless link layer + per-packet adaptive routing) Flow-level deadline-based approaches D3, D2TCP, PDQ
Conclusion DRB achieves DRB can be readily implemented 100% bandwidth utilization Almost 0 queuing delay Few out-of-order packet arrivals DRB can be readily implemented Servers for packet encap and switches for packet decap
Q & A This is the end my presentation. Thank you. Any questions?