Download presentation
Presentation is loading. Please wait.
1
Masking Failures from Application Performance in Data Center Networks with Shareable Backup
Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+, Xin Sunny Huang+, Simbarashe Dzinamarira+, T. S. Eugene Ng+ +Rice University, *Facebook, Inc. 11/16/2018
2
Data Center Network Should be Reliable
but… 11/16/2018
3
Network Failures are Disruptive
Median case of failures: 10% less traffic delivered Worst 20% of failures: 40% less traffic delivered Gill et al. SIGCOMM 2011 11/16/2018
4
Today’s Failure Handling---Rerouting
Fast local rerouting inflated path length Global optimal rerouting high latency of routes updates Impact flows not traveling trough the failure location 11/16/2018
5
Impact on Coflow Completion Time (CCT)
Facebook coflow trace k = 16 Fat-tree network Global optimal rerouting 11/16/2018
6
Do We Have Other Options?
Restores network capacity immediately after failure Be cost efficient --Small pool of backup switch How do we achieve that? 11/16/2018
7
Circuit Switches A C B D Physical layer device
Circuit controlled by software A C Examples --optical 2D-MEMS switch, 40us, $10 per-port cost --electrical cross-point switch, 70ns, $3 per-port cost B D 11/16/2018
8
Ideal Architecture Regular switches Servers Backup Switch
… Regular switches Servers Backup Switch Circuit Switch Entire network shares one backup switch Unreasonable high port-count of circuit switch Replace any failed switch when necessary Single point of failure 11/16/2018
9
How to Make It Practical
Feasibility -small port-count circuit switches Scalability -partition network into failure groups -distribute circuit switches across the network Low cost -small backup pool -share backup switches per failure groups 11/16/2018
10
ShareBackup Architecture
An original Fat-tree with k = 6 Edge layer Agg. layer Core layer Partition the switches into failure groups; each with k/2 switches. Add backup switches per failure groups 11/16/2018
11
Edge Layer Circuit switches 1 2 Edge Backup Switch i 1 2 1 2 Servers
1 2 Edge Servers Backup Switch i 1 2 1 2 11/16/2018
12
? Aggregation Layer 1 2 Agg. switches Backup switch Circuit Edge 1 2 1
1 2 Agg. switches Backup switch Circuit Edge 1 2 1 2 11/16/2018
13
Core Layer Core switches 3 6 1 4 7 2 5 8 Circuit switches Aggregation
3 6 1 4 7 2 5 8 Circuit switches Aggregation switches Backup switch 1 2 1 2 1 2 11/16/2018
14
Recover First, Diagnose Later
Failure Recovery --switch failure replaced by backups via circuit reconfiguration --link failure switches on both side are replaced Automatic failure diagnosis performed offline -details in the paper 11/16/2018
15
Live Impersonation of Failed Switch
Edge switches 1 2 Backup switch Routing Table of Every Edge Switch Routing Table 0 VLAN 0 Routing Table 1 VLAN 1 Routing Table 2 VLAN 2 11/16/2018 Servers
16
Live Impersonation of Failed Switch
Edge switches 1 2 Backup switch Routing Table of Every Edge Switch Routing Table 0 VLAN 0 Routing Table 1 VLAN 1 Originally, each switch has a different routing table. Switch 0 has routing table 0, switch 1 has routing table 1… But in sharebackup Routing Table 2 VLAN 2 11/16/2018 Servers
17
Live Impersonation of Failed Switch
Edge switches 1 2 Backup switch Routing Table of Every Edge Switch Routing Table 0 VLAN 0 Routing Table 1 VLAN 1 Routing Table 2 VLAN 2 11/16/2018 Servers
18
What does control system do?
Collects keep-alive messages & link status reports from switches Reconfigures circuit switches under failures Performs offline failure diagnosis Implications -needs to talk to many circuit switches and packet switches -keeps a large amount of states of circuit/switch/link status 11/16/2018
19
Distributed Control System
One controller for a failure group of k/2 switches --configures the circuit switches adjacent to switches in the group Maintains only local circuit configurations in its group --does not share states with other controllers Talks to circuit switches using an out-of-band control network 11/16/2018
20
Fast failure recovery, no path dilation, no routing disturbance
Summary Fast Failure Recovery --as fast as the underlying circuit switching technology Live Impersonation --Traffic is redirected to the backups in physical layer --Switches in a failure group have same routing tables, use VLAN id for differentiation --Regular switches recovered from failures become backup switches themselves Fast failure recovery, no path dilation, no routing disturbance 11/16/2018
21
Evaluation Bandwidth Advantage Application performance
--Iperf throughput on testbed Application performance --MapReduce job completion time 11/16/2018
22
Bandwidth Advantage 4 racks, 8 servers, 12 switches 8 iPerf flows saturate the network core ShareBackup restores network to full capacity regardless of failure locations 11/16/2018
23
Application Performance
4.2X MapReduce Sort w/ 100GB input data 1.2X ShareBackup preserves application performance under failures! 11/16/2018
24
Extra Cost Small port-count circuit switches--- very inexpensive
--e.g. $3 per-port cost for cross-point switches Small backup switch pool --1 backup per failure group is usually enough --k = 48 fat-tree with servers ~6.7% extra network cost Partial deployment --failures more destructive at edge layer --employ backup only for ToR failures 11/16/2018
25
Conclusion ShareBackup: an architectural solution for failure recovery in DCNs --uses circuit switching for fast failover --is an economical approach of using backups in networks --preserves application performance under failures Key takeaways: --rerouting is not the only approach for failure recovery --fast, transparent failure recovery is possible through careful backup placements & fast circuit switching 11/16/2018
26
Backup---Control System Failures
Circuit switch software failure/control channel failure --circuit switches become unresponsive --keep existing circuit configurations, data plane is not impacted --fall back to rerouting Hardware/power failure --controller will receive lots failure reports in a short time --call for human intervention Since ShareBackup uses a separate control network for failure recovery. It must handle potential failures in the control system themselves. Controller failure --state replication on shadow controllers 11/16/2018
27
Backup---Offline Failure Diagnosis
Aggregation switch ? ? Recycle healthy switch - Only one switch has failed - Back to normal after reboot Chain up circuit switches using side ports Circuit switches ? ? Edge switches 11/16/2018 17
28
Backup---Offline Failure Diagnosis
Aggregation switch Circuit switches Edge switches 11/16/2018 18
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.