Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,

Similar presentations


Presentation on theme: "Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,"— Presentation transcript:

1 Masking Failures from Application Performance in Data Center Networks with Shareable Backup
Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+, Xin Sunny Huang+, Simbarashe Dzinamarira+, T. S. Eugene Ng+ +Rice University, *Facebook, Inc. 11/16/2018

2 Data Center Network Should be Reliable
but… 11/16/2018

3 Network Failures are Disruptive
Median case of failures: 10% less traffic delivered Worst 20% of failures: 40% less traffic delivered Gill et al. SIGCOMM 2011 11/16/2018

4 Today’s Failure Handling---Rerouting
Fast local rerouting  inflated path length Global optimal rerouting  high latency of routes updates Impact flows not traveling trough the failure location 11/16/2018

5 Impact on Coflow Completion Time (CCT)
Facebook coflow trace k = 16 Fat-tree network Global optimal rerouting 11/16/2018

6 Do We Have Other Options?
Restores network capacity immediately after failure Be cost efficient --Small pool of backup switch How do we achieve that? 11/16/2018

7 Circuit Switches A C B D Physical layer device
Circuit controlled by software A C Examples --optical 2D-MEMS switch, 40us, $10 per-port cost --electrical cross-point switch, 70ns, $3 per-port cost B D 11/16/2018

8 Ideal Architecture Regular switches Servers Backup Switch
Regular switches Servers Backup Switch Circuit Switch Entire network shares one backup switch Unreasonable high port-count of circuit switch Replace any failed switch when necessary Single point of failure 11/16/2018

9 How to Make It Practical
Feasibility -small port-count circuit switches Scalability -partition network into failure groups -distribute circuit switches across the network Low cost -small backup pool -share backup switches per failure groups 11/16/2018

10 ShareBackup Architecture
An original Fat-tree with k = 6 Edge layer Agg. layer Core layer Partition the switches into failure groups; each with k/2 switches. Add backup switches per failure groups 11/16/2018

11 Edge Layer Circuit switches 1 2 Edge Backup Switch i 1 2 1 2 Servers
1 2 Edge Servers Backup Switch i 1 2 1 2 11/16/2018

12 ? Aggregation Layer 1 2 Agg. switches Backup switch Circuit Edge 1 2 1
1 2 Agg. switches Backup switch Circuit Edge 1 2 1 2 11/16/2018

13 Core Layer Core switches 3 6 1 4 7 2 5 8 Circuit switches Aggregation
3 6 1 4 7 2 5 8 Circuit switches Aggregation switches Backup switch 1 2 1 2 1 2 11/16/2018

14 Recover First, Diagnose Later
Failure Recovery --switch failure replaced by backups via circuit reconfiguration --link failure switches on both side are replaced Automatic failure diagnosis performed offline -details in the paper 11/16/2018

15 Live Impersonation of Failed Switch
Edge switches 1 2 Backup switch Routing Table of Every Edge Switch Routing Table 0 VLAN 0 Routing Table 1 VLAN 1 Routing Table 2 VLAN 2 11/16/2018 Servers

16 Live Impersonation of Failed Switch
Edge switches 1 2 Backup switch Routing Table of Every Edge Switch Routing Table 0 VLAN 0 Routing Table 1 VLAN 1 Originally, each switch has a different routing table. Switch 0 has routing table 0, switch 1 has routing table 1… But in sharebackup Routing Table 2 VLAN 2 11/16/2018 Servers

17 Live Impersonation of Failed Switch
Edge switches 1 2 Backup switch Routing Table of Every Edge Switch Routing Table 0 VLAN 0 Routing Table 1 VLAN 1 Routing Table 2 VLAN 2 11/16/2018 Servers

18 What does control system do?
Collects keep-alive messages & link status reports from switches Reconfigures circuit switches under failures Performs offline failure diagnosis Implications -needs to talk to many circuit switches and packet switches -keeps a large amount of states of circuit/switch/link status 11/16/2018

19 Distributed Control System
One controller for a failure group of k/2 switches --configures the circuit switches adjacent to switches in the group Maintains only local circuit configurations in its group --does not share states with other controllers Talks to circuit switches using an out-of-band control network 11/16/2018

20 Fast failure recovery, no path dilation, no routing disturbance
Summary Fast Failure Recovery --as fast as the underlying circuit switching technology Live Impersonation --Traffic is redirected to the backups in physical layer --Switches in a failure group have same routing tables, use VLAN id for differentiation --Regular switches recovered from failures become backup switches themselves Fast failure recovery, no path dilation, no routing disturbance 11/16/2018

21 Evaluation Bandwidth Advantage Application performance
--Iperf throughput on testbed Application performance --MapReduce job completion time 11/16/2018

22 Bandwidth Advantage 4 racks, 8 servers, 12 switches 8 iPerf flows saturate the network core ShareBackup restores network to full capacity regardless of failure locations 11/16/2018

23 Application Performance
4.2X MapReduce Sort w/ 100GB input data 1.2X ShareBackup preserves application performance under failures! 11/16/2018

24 Extra Cost Small port-count circuit switches--- very inexpensive
--e.g. $3 per-port cost for cross-point switches Small backup switch pool --1 backup per failure group is usually enough --k = 48 fat-tree with servers  ~6.7% extra network cost Partial deployment --failures more destructive at edge layer --employ backup only for ToR failures 11/16/2018

25 Conclusion ShareBackup: an architectural solution for failure recovery in DCNs --uses circuit switching for fast failover --is an economical approach of using backups in networks --preserves application performance under failures Key takeaways: --rerouting is not the only approach for failure recovery --fast, transparent failure recovery is possible through careful backup placements & fast circuit switching 11/16/2018

26 Backup---Control System Failures
Circuit switch software failure/control channel failure --circuit switches become unresponsive --keep existing circuit configurations, data plane is not impacted --fall back to rerouting Hardware/power failure --controller will receive lots failure reports in a short time --call for human intervention Since ShareBackup uses a separate control network for failure recovery. It must handle potential failures in the control system themselves. Controller failure --state replication on shadow controllers 11/16/2018

27 Backup---Offline Failure Diagnosis
Aggregation switch ? ? Recycle healthy switch - Only one switch has failed - Back to normal after reboot Chain up circuit switches using side ports Circuit switches ? ? Edge switches 11/16/2018 17

28 Backup---Offline Failure Diagnosis
Aggregation switch Circuit switches Edge switches 11/16/2018 18


Download ppt "Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,"

Similar presentations


Ads by Google