Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,

Name: Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,
Uploaded: 2017-08-15T00:15:49+00:00
Duration: PTM7S40
Channel: Gwendolyn McDonald
Description: Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,

Per-packet Load-balanced, Low-Latency Routing for Clos-based Data Center Networks
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin Zheng, Haitao Wu, Yongqiang Xiong, Dave Maltz December Santa Barbara, California

Outline Background DRB for load-balance and low latency
DRB for 100% bandwidth utilization DRB latency modeling Routing design and failure handling Evaluations Related work Conclusion

Clos-based DCN: background
Topology Routing Equal-cost multi-path (ECMP) Given a spine switch, there is only one path from a src to a dst in fat-tree

Clos-based DCN: issues
Low network utilization Due to flow-based hash collision for ECMP High network latency tail results high user perceived latency Many DC applications use thousands or more TCP connections

Network latency measurement
400us Network latency has a long tail Busy servers do not contribute to the long latency tail Server network stack increases the latency by several hundred us 1.5ms 2ms

Where the latency tail comes from
A (temporary) congested switch port can use several MB for packet buffering 1MB buffer introduces 1ms latency for a 10G link For a three layer DCN, intra-DC communications take up to 5 hops

The challenge The challenge
Given a full bisection bandwidth Clos network, achieve 100% bandwidth utilization and 0 in-network latency Many ways to improve, but none addresses the challenge fully E.g., use traffic engineering for better bandwidth utilization, introduce ECN for latency mitigation Our answer: DRB The challenge:

Digit-reversal bouncing (DRB)
Right time for per-packet routing Regular Clos topology Server software stack under control Switches become open and programmable DRB Achieves 100% bandwidth utilization by per-packet routing Achieves small queuing delay by its “Digit-reversal” algorithm Can be readily implemented

Achieve 100% bandwidth utilization
Sufficient condition for 100% utilization In a Fat-tree network, given an arbitrary feasible traffic matrix, if a routing algorithm can evenly spread the traffic ai,j from server i to server j among all the possible uplinks at every layer, then all the links, including all the downlinks, are not overloaded The condition implies: Oblivious load-balancing: no need of traffic matrix Packet bouncing: only need to load-balance uplinks Source-destination pair instead of flow-based

DRB for fat-tree DRB for bouncing switch selection: Seq Digit-reversal
Spine switch 00 3.0 (00) 01 10 3.2 (10) 3.1 (01) 11 3.3 (11) Many ways to meet the sufficient condition: RB (random bouncing) RRB (round-robin bouncing)

Queuing latency modeling
First-hop queue length vs. traffic load with switch port number 24 First-hop queue length vs. switch port number when traffic load is 0.95 DRB and RRB achieves bounded queue length when load approaches 100% Queue length of RRB in proportion to $n^2$ Queue length of DRB is very small (2-3 pkts)

DRB for VL2 Given a spine switch, there are multiple paths between a source and a destination in VL2 DRB splits a spine switch into multiple “virtual spine switches”

DRB routing and failure handling
Servers choose bouncing switch for each packet Switches use static routing Switches are programmed to maintain up-to-date network topology Leverage network topology to minimize broadcast messages

Simulation: network utilization
Simulation setup: Pkt level simulation with NS3 Three-layer fat-tree and VL2 with servers Permutation traffic pattern TCP for transport protocol with 256KB buffer size Resequencing buffer for out-of-order packet arrivals

Simulation: queuing delay
RRB results in large queuing delay at the first and forth hops DRB achieves the smallest queuing delay even thought its throughput is the highest

Simulations: out-of-order arrivals
Resequencing delay is defined as the time a packet stays in the resequencing buffer RB’s resequencing delay is the worst Resequencing delay is not directly related to queuing delay DRB achieves very small number of out-of-order packet arrivals

Implementation and testbed
Servers : perform ip-in-ip packet encap for each source-destination pair at the sending side; packet re-sequencing at the receiving side Switches: ip-in-ip packet decap; topology maintenance Testbed A three-layer fat-tree with 54 servers Each switch has 6 ports

Experiments: queuing delay
RB results in large queue length (250KB per port) DRB and RRB performs similar since each switch has only 3 uplinks DRB’s queue length is only 2-3 pkts Same with the queue modeling and simulation results

Related work Random-based per-packet routing Flowlet based LocalFlow
Random Packet Spraying (RPS) Random-based per packet VLB Flowlet based LocalFlow DeTail (lossless link layer + per-packet adaptive routing) Flow-level deadline-based approaches D3, D2TCP, PDQ

Conclusion DRB achieves DRB can be readily implemented
100% bandwidth utilization Almost 0 queuing delay Few out-of-order packet arrivals DRB can be readily implemented Servers for packet encap and switches for packet decap

Q & A This is the end my presentation. Thank you. Any questions?

Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,

Similar presentations

Presentation on theme: "Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,

Similar presentations

Presentation on theme: "Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,"— Presentation transcript:

Similar presentations

About project

Feedback