Download presentation
Published byGwendolyn McDonald Modified over 9 years ago
1
Per-packet Load-balanced, Low-Latency Routing for Clos-based Data Center Networks
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin Zheng, Haitao Wu, Yongqiang Xiong, Dave Maltz December Santa Barbara, California
2
Outline Background DRB for load-balance and low latency
DRB for 100% bandwidth utilization DRB latency modeling Routing design and failure handling Evaluations Related work Conclusion
3
Clos-based DCN: background
Topology Routing Equal-cost multi-path (ECMP) Given a spine switch, there is only one path from a src to a dst in fat-tree
4
Clos-based DCN: issues
Low network utilization Due to flow-based hash collision for ECMP High network latency tail results high user perceived latency Many DC applications use thousands or more TCP connections
5
Network latency measurement
400us Network latency has a long tail Busy servers do not contribute to the long latency tail Server network stack increases the latency by several hundred us 1.5ms 2ms
6
Where the latency tail comes from
A (temporary) congested switch port can use several MB for packet buffering 1MB buffer introduces 1ms latency for a 10G link For a three layer DCN, intra-DC communications take up to 5 hops
7
The challenge The challenge
Given a full bisection bandwidth Clos network, achieve 100% bandwidth utilization and 0 in-network latency Many ways to improve, but none addresses the challenge fully E.g., use traffic engineering for better bandwidth utilization, introduce ECN for latency mitigation Our answer: DRB The challenge:
8
Digit-reversal bouncing (DRB)
Right time for per-packet routing Regular Clos topology Server software stack under control Switches become open and programmable DRB Achieves 100% bandwidth utilization by per-packet routing Achieves small queuing delay by its “Digit-reversal” algorithm Can be readily implemented
9
Achieve 100% bandwidth utilization
Sufficient condition for 100% utilization In a Fat-tree network, given an arbitrary feasible traffic matrix, if a routing algorithm can evenly spread the traffic ai,j from server i to server j among all the possible uplinks at every layer, then all the links, including all the downlinks, are not overloaded The condition implies: Oblivious load-balancing: no need of traffic matrix Packet bouncing: only need to load-balance uplinks Source-destination pair instead of flow-based
10
DRB for fat-tree DRB for bouncing switch selection: Seq Digit-reversal
Spine switch 00 3.0 (00) 01 10 3.2 (10) 3.1 (01) 11 3.3 (11) Many ways to meet the sufficient condition: RB (random bouncing) RRB (round-robin bouncing)
11
Queuing latency modeling
First-hop queue length vs. traffic load with switch port number 24 First-hop queue length vs. switch port number when traffic load is 0.95 DRB and RRB achieves bounded queue length when load approaches 100% Queue length of RRB in proportion to $n^2$ Queue length of DRB is very small (2-3 pkts)
12
DRB for VL2 Given a spine switch, there are multiple paths between a source and a destination in VL2 DRB splits a spine switch into multiple “virtual spine switches”
13
DRB routing and failure handling
Servers choose bouncing switch for each packet Switches use static routing Switches are programmed to maintain up-to-date network topology Leverage network topology to minimize broadcast messages
14
Simulation: network utilization
Simulation setup: Pkt level simulation with NS3 Three-layer fat-tree and VL2 with servers Permutation traffic pattern TCP for transport protocol with 256KB buffer size Resequencing buffer for out-of-order packet arrivals
15
Simulation: queuing delay
RRB results in large queuing delay at the first and forth hops DRB achieves the smallest queuing delay even thought its throughput is the highest
16
Simulations: out-of-order arrivals
Resequencing delay is defined as the time a packet stays in the resequencing buffer RB’s resequencing delay is the worst Resequencing delay is not directly related to queuing delay DRB achieves very small number of out-of-order packet arrivals
17
Implementation and testbed
Servers : perform ip-in-ip packet encap for each source-destination pair at the sending side; packet re-sequencing at the receiving side Switches: ip-in-ip packet decap; topology maintenance Testbed A three-layer fat-tree with 54 servers Each switch has 6 ports
18
Experiments: queuing delay
RB results in large queue length (250KB per port) DRB and RRB performs similar since each switch has only 3 uplinks DRB’s queue length is only 2-3 pkts Same with the queue modeling and simulation results
19
Related work Random-based per-packet routing Flowlet based LocalFlow
Random Packet Spraying (RPS) Random-based per packet VLB Flowlet based LocalFlow DeTail (lossless link layer + per-packet adaptive routing) Flow-level deadline-based approaches D3, D2TCP, PDQ
20
Conclusion DRB achieves DRB can be readily implemented
100% bandwidth utilization Almost 0 queuing delay Few out-of-order packet arrivals DRB can be readily implemented Servers for packet encap and switches for packet decap
21
Q & A This is the end my presentation. Thank you. Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.