Presentation is loading. Please wait.

Presentation is loading. Please wait.

NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.

Similar presentations


Presentation on theme: "NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang."— Presentation transcript:

1 NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang Presented by: Chen Li

2 Failures are Common and Harmful Network failures are common 10,000+ switches 2

3 Failures are Common and Harmful Network failures are common Failures cause long down times Time from detection to repair (minutes) Six-month failure logs of production datacenters 25% of failures take 13+ hours to repair 3

4 Failures are Common and Harmful Failures are common due to VERY large datacenters Failures cause long down times Long failure duration  large revenue loss 4

5 How to Shorten Failure Recovery Time?

6 Previous Work Conventional failure recovery takes 3 steps Failure localization/diagnosis – [M. K. Aguilera, SOSP’03] – [M. Y. Chen, NSDI’04] – [R.R Kompella, NSDI ’05] – [P.Bahl, SIGCOMM’07] – [S. Kandula, SIGCOMM’09]… DetectionDiagnosisRepair passive ping active 6

7 Automating Failure Diagnosis is Challenging Root causes are deep in network stack Diagnosis involves multiple parties 7

8 CategoryFailure typesDiagnosis & Repair % Software 21%Link layer loopFind and fix bugs 19% Imbalance  overload2% Hardware 18%FCS errorReplace cable13% Unstable powerRepair power5% Unknown 23%Switch stops forwardingN/A9% Imbalance  overload7% Lost configuration5% High CPU utilization2% Configuration 38% Errors on multiple switches Update configuration 32% Errors on one switch6%6% Six -month failure logs from several production DCNs 1. Root causes are deep in the network stack 2. Diagnosis involves multiple parties Failure Diagnosis Requires Human Intervention ! 8

9 Can we do something other than failure diagnosis?

10 NetPilot: Mitigating rather than Diagnosing Failures Mitigate failure symptoms ASAP, at the cost of reduced capacity Detection DiagnosisRepair Automated Mitigation 10

11 NetPilot Benefits Short recovery time Small network disruption Low operation cost 11 Automated Mitigation Detection DiagnosisRepair

12 Failure Mitigation is Effective Most failures can be mitigated by simple actions Mitigation is feasible due to redundancy 12

13 CategoryFailure typesMitigationRepair% Software 21% Link layer loopDeactivate portFind and fix bugs 19% Imbalance- triggered overload Restart switch 2% Hardware 18% FCS errorDeactivate portReplace cable13% Unstable powerDeactivate switchRepair power5% Unknown 23% Switch stops forwarding Restart switchN/A9% Imbalance- triggered overload Restart switch7% Lost configurationRestart switch5% High CPU utilization Restart switch2% Configurati on 38% Errors on multiple switches n/aUpdate configuration 32% Errors on single switch Deactivate switch6%6% 68% of failures can be mitigated by simple actions 13

14 Mitigation Made Possible by Redundancy Redundancy  deactivation unlikely to partition / overload the network ToR AGG CORE Internet 14

15 Outline Automating failure diagnosis is challenging Failure mitigation is effective How to automate mitigation? NetPilot evaluations Conclusion 15

16 A Strawman NetPilot: Trial-and-error Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization 16

17 NetPilot: Challenges & Solutions 1. Blind trial-and-error takes a long time Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Failure specific localization 17

18 NetPilot: Challenges & Solutions Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Localization 2. Partition/overload network Impact estimation 18

19 NetPilot: Challenges & Solutions Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Rank actions Localization 3. Different actions have different side-effects Rank actions based on impact 19

20 Failure Specific Localization Limited # of failure types Domain knowledge improves accuracy Failure types 1. Link layer loop 2. Imbalance-triggered overload 3. FCS error 4. Unstable power 5. Switch stops forwarding 6. Imbalance-triggered overload 7. Lost configuration 8. High CPU utilization 9. Errors on multiple switches 10. Errors on single switch 20

21 Example : Frame Check Sequence (FCS) Errors 13% of all the failures Cut-through switching – Forward frames before checksums are verified Increase application latency 21

22 Localizing FCS Errors error frames seen on Lframes corrupted by L frames corrupted by other links & traverse L x L : link corruption rate # of variables = # of equations = # of links Corrupted links: x L > 0 22

23 NetPilot Overview Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Rank actions 23

24 Impact Metrics Derived from Service Level Agreement (SLA) – Availability: online_server_ratio – Packet loss: total_lost_pkt – latency: max_link_utilization Small link utilization  small (queuing) delay Total_lost_pkt & max_link_utilization derived from utilization of individual links 24

25 Estimating Link Utilization # of flows >> redundant paths – Traffic evenly distributed under ECMP Estimate the load contributed by each flow on each link Sum up the loads to compute utilization Impact Estimator Action Traffic Link utilization Topology 25

26 Link Utilization Estimation is Highly Accurate 1-month traffic from a 8000-server network – Log socket events on each server Ground truth: SNMP counters 26

27 NetPilot Overview Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Rank actions Choose the action with the least impact 27

28 Outline Automating failure diagnosis is challenging Failure mitigation is effective How to automate mitigation? – Localization  impact estimation  ranking NetPilot evaluations – Mitigating load imbalance – Mitigating FCS errors – Mitigating overload Conclusion 28

29 Load Imbalance Agg a stops receiving traffic Localize to 4 suspects core a Agg a core b Agg b 29

30 Mitigating Load Imbalance core a -> agg a core b -> agg a core a -> agg b core b -> agg b Agg a stops receiving traffic Detected & reboot core b Reboot core a Reboot Agg a Mitigation confirmed Load evenly splitted core a Agg a core b Agg b 30

31 Fast FCS Error Mitigation NetPilot: deactivates 2 links in 1 trial within 15 minutes Human operator: after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated 3.5 hours  15 minutes 31

32 Mitigating Link Overload Mitigate overload by deactivating healthy links 32 core 1 1.5 3 agg core 2 core 1 agg

33 Mitigating Link Overload Mitigate overload by deactivating healthy links – Many candidate links in production networks – Choose the link(s) with the least impact 33 core 1 1.5 3 agg core 2 core 1 1 1.5 3 agg core 2 core 1 0 3 3 agg core 2 lost 0.5

34 Action Ranking Lowers Link Utilization Replay 97 overload incidents due to link failures 34

35 Conclusion Mitigation shortens failure recovery time – Simple actions are effective – Made possible by redundancy NetPilot: automating failure mitigation – Recovery time: hour  minutes – Several mitigation scenarios deployed in Bing 35

36 Thank You! DetectionDiagnosisRepair NetPilot: Automated Mitigation netpilot@microsoft.com 36


Download ppt "NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang."

Similar presentations


Ads by Google