NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.

Slides:



Advertisements
Similar presentations
CSCI-1680 Network Layer: Intra-domain Routing Based partly on lecture notes by David Mazières, Phil Levis, John Jannotti Rodrigo Fonseca.
Advertisements

Interconnection Networks: Flow Control and Microarchitecture.
Cost-Based Cache Replacement and Server Selection for Multimedia Proxy Across Wireless Internet Qian Zhang Zhe Xiang Wenwu Zhu Lixin Gao IEEE Transactions.
Group Research 1: AKHTAR, Kamran SU, Hao SUN, Qiang TANG, Yue
Traffic Engineering with Forward Fault Correction (FFC)
Dynamic Scheduling of Network Updates Xin Jin Hongqiang Harry Liu, Rohan Gandhi, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Jennifer Rexford, Roger Wattenhofer.
Fine-Grained Latency and Loss Measurements in the Presence of Reordering Myungjin Lee, Sharon Goldberg, Ramana Rao Kompella, George Varghese.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
Enabling Flow-level Latency Measurements across Routers in Data Centers Parmjeet Singh, Myungjin Lee Sagar Kumar, Ramana Rao Kompella.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
Profiling Network Performance in Multi-tier Datacenter Applications
Reconfigurable Network Topologies at Rack Scale
King : Estimating latency between arbitrary Internet end hosts Krishna Gummadi, Stefan Saroiu Steven D. Gribble University of Washington Presented by:
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
Nov-03 ©Cisco Systems CCNA Semester 1 Version 3 Comp11 Mod8 – St. Lawrence College – Cornwall Campus, ON, Canada – Clark slide 1 Cisco Systems CCNA Version.
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
Michael Over.  Which devices/links are most unreliable?  What causes failures?  How do failures impact network traffic?  How effective is network.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
1 25\10\2010 Unit-V Connecting LANs Unit – 5 Connecting DevicesConnecting Devices Backbone NetworksBackbone Networks Virtual LANsVirtual LANs.
Data Center Traffic and Measurements: Available Bandwidth Estimation Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance.
Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†
Layer 2 Switch  Layer 2 Switching is hardware based.  Uses the host's Media Access Control (MAC) address.  Uses Application Specific Integrated Circuits.
Internet Traffic Management Prafull Suryawanshi Roll No - 04IT6008.
Sven Ubik, CESNET TNC2004, Rhodos, 9 June 2004 Performance monitoring of high-speed networks from NREN perspective.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 5: Inter-VLAN Routing Routing And Switching.
Introduction to IT and Communications Technology Justin Champion C208 – 3292 Ethernet Switching CE
Semester 1 Module 8 Ethernet Switching Andres, Wen-Yuan Liao Department of Computer Science and Engineering De Lin Institute of Technology
Internet Traffic Management. Basic Concept of Traffic Need of Traffic Management Measuring Traffic Traffic Control and Management Quality and Pricing.
Microsoft ® Official Course Module 10 Optimizing and Maintaining Windows ® 8 Client Computers.
Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
All-Path Bridging Update IEEE Plenary meeting San Francisco July Jun Tanaka (Fujitsu Labs. Ld.) Guillermo Ibanez (UAH) Vinod Kumar (Tejas Networks.
DARD: Distributed Adaptive Routing for Datacenter Networks Xin Wu, Xiaowei Yang.
Resilient Peer-to-Peer Streaming Presented by: Yun Teng.
Streaming over Subscription Overlay Networks Department of Computer Science Iowa State University.
Abstract Link error and malicious packet dropping are two sources for packet losses in multi-hop wireless ad hoc network. In this paper, while observing.
VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.
FireProof. The Challenge Firewall - the challenge Network security devices Critical gateway to your network Constant service The Challenge.
Stamping out worms and other Internet pests Miguel Castro Microsoft Research.
Configuring Cisco Switches Chapter 13 powered by DJ 1.
Sem1 - Module 8 Ethernet Switching. Shared media environments Shared media environment: –Occurs when multiple hosts have access to the same medium. –For.
1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.
Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU.
Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)
Chapter 13: LAN Maintenance. Documentation Document your LAN so that you have a record of equipment location and configuration. Documentation should include.
1 Root-Cause VoIP Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions.
Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang,
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
OSI Model Topology Patch cable Flow control Common layer 2 device Best path Purpose of Layer 2 devices.
Automated Network Repair with Meta Provenance
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Revisiting Transport Congestion Control Jian He UT Austin 1.
PATH DIVERSITY WITH FORWARD ERROR CORRECTION SYSTEM FOR PACKET SWITCHED NETWORKS Thinh Nguyen and Avideh Zakhor IEEE INFOCOM 2003.
Frame counter: Achieving Accurate and Real-Time Link Estimation in Low Power Wireless Sensor Networks Daibo Liu, Zhichao Cao, Mengshu Hou and Yi Zhang.
Manajemen Jaringan, Sukiswo ST, MT 1 Network Monitoring Sukiswo
SketchVisor: Robust Network Measurement for Software Packet Processing
VL2: A Scalable and Flexible Data Center Network
Chen Qian, Xin Li University of Kentucky
A Network-State Management Service
Problem: Internet diagnostics and forensics
Sources of Failure in the Public Switched Telephone Network
Resilient Datacenter Load Balancing in the Wild
Jian Wu (University of Michigan)
DISASTER RECOVERY INSTITUTE INTERNATIONAL
ISP and Egress Path Selection for Multihomed Networks
Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,
COS 561: Advanced Computer Networks
Performance Evaluation of Computer Networks
Presentation transcript:

NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang Presented by: Chen Li

Failures are Common and Harmful Network failures are common 10,000+ switches 2

Failures are Common and Harmful Network failures are common Failures cause long down times Time from detection to repair (minutes) Six-month failure logs of production datacenters 25% of failures take 13+ hours to repair 3

Failures are Common and Harmful Failures are common due to VERY large datacenters Failures cause long down times Long failure duration  large revenue loss 4

How to Shorten Failure Recovery Time?

Previous Work Conventional failure recovery takes 3 steps Failure localization/diagnosis – [M. K. Aguilera, SOSP’03] – [M. Y. Chen, NSDI’04] – [R.R Kompella, NSDI ’05] – [P.Bahl, SIGCOMM’07] – [S. Kandula, SIGCOMM’09]… DetectionDiagnosisRepair passive ping active 6

Automating Failure Diagnosis is Challenging Root causes are deep in network stack Diagnosis involves multiple parties 7

CategoryFailure typesDiagnosis & Repair % Software 21%Link layer loopFind and fix bugs 19% Imbalance  overload2% Hardware 18%FCS errorReplace cable13% Unstable powerRepair power5% Unknown 23%Switch stops forwardingN/A9% Imbalance  overload7% Lost configuration5% High CPU utilization2% Configuration 38% Errors on multiple switches Update configuration 32% Errors on one switch6%6% Six -month failure logs from several production DCNs 1. Root causes are deep in the network stack 2. Diagnosis involves multiple parties Failure Diagnosis Requires Human Intervention ! 8

Can we do something other than failure diagnosis?

NetPilot: Mitigating rather than Diagnosing Failures Mitigate failure symptoms ASAP, at the cost of reduced capacity Detection DiagnosisRepair Automated Mitigation 10

NetPilot Benefits Short recovery time Small network disruption Low operation cost 11 Automated Mitigation Detection DiagnosisRepair

Failure Mitigation is Effective Most failures can be mitigated by simple actions Mitigation is feasible due to redundancy 12

CategoryFailure typesMitigationRepair% Software 21% Link layer loopDeactivate portFind and fix bugs 19% Imbalance- triggered overload Restart switch 2% Hardware 18% FCS errorDeactivate portReplace cable13% Unstable powerDeactivate switchRepair power5% Unknown 23% Switch stops forwarding Restart switchN/A9% Imbalance- triggered overload Restart switch7% Lost configurationRestart switch5% High CPU utilization Restart switch2% Configurati on 38% Errors on multiple switches n/aUpdate configuration 32% Errors on single switch Deactivate switch6%6% 68% of failures can be mitigated by simple actions 13

Mitigation Made Possible by Redundancy Redundancy  deactivation unlikely to partition / overload the network ToR AGG CORE Internet 14

Outline Automating failure diagnosis is challenging Failure mitigation is effective How to automate mitigation? NetPilot evaluations Conclusion 15

A Strawman NetPilot: Trial-and-error Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization 16

NetPilot: Challenges & Solutions 1. Blind trial-and-error takes a long time Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Failure specific localization 17

NetPilot: Challenges & Solutions Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Localization 2. Partition/overload network Impact estimation 18

NetPilot: Challenges & Solutions Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Rank actions Localization 3. Different actions have different side-effects Rank actions based on impact 19

Failure Specific Localization Limited # of failure types Domain knowledge improves accuracy Failure types 1. Link layer loop 2. Imbalance-triggered overload 3. FCS error 4. Unstable power 5. Switch stops forwarding 6. Imbalance-triggered overload 7. Lost configuration 8. High CPU utilization 9. Errors on multiple switches 10. Errors on single switch 20

Example : Frame Check Sequence (FCS) Errors 13% of all the failures Cut-through switching – Forward frames before checksums are verified Increase application latency 21

Localizing FCS Errors error frames seen on Lframes corrupted by L frames corrupted by other links & traverse L x L : link corruption rate # of variables = # of equations = # of links Corrupted links: x L > 0 22

NetPilot Overview Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Rank actions 23

Impact Metrics Derived from Service Level Agreement (SLA) – Availability: online_server_ratio – Packet loss: total_lost_pkt – latency: max_link_utilization Small link utilization  small (queuing) delay Total_lost_pkt & max_link_utilization derived from utilization of individual links 24

Estimating Link Utilization # of flows >> redundant paths – Traffic evenly distributed under ECMP Estimate the load contributed by each flow on each link Sum up the loads to compute utilization Impact Estimator Action Traffic Link utilization Topology 25

Link Utilization Estimation is Highly Accurate 1-month traffic from a 8000-server network – Log socket events on each server Ground truth: SNMP counters 26

NetPilot Overview Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Rank actions Choose the action with the least impact 27

Outline Automating failure diagnosis is challenging Failure mitigation is effective How to automate mitigation? – Localization  impact estimation  ranking NetPilot evaluations – Mitigating load imbalance – Mitigating FCS errors – Mitigating overload Conclusion 28

Load Imbalance Agg a stops receiving traffic Localize to 4 suspects core a Agg a core b Agg b 29

Mitigating Load Imbalance core a -> agg a core b -> agg a core a -> agg b core b -> agg b Agg a stops receiving traffic Detected & reboot core b Reboot core a Reboot Agg a Mitigation confirmed Load evenly splitted core a Agg a core b Agg b 30

Fast FCS Error Mitigation NetPilot: deactivates 2 links in 1 trial within 15 minutes Human operator: after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated 3.5 hours  15 minutes 31

Mitigating Link Overload Mitigate overload by deactivating healthy links 32 core agg core 2 core 1 agg

Mitigating Link Overload Mitigate overload by deactivating healthy links – Many candidate links in production networks – Choose the link(s) with the least impact 33 core agg core 2 core agg core 2 core agg core 2 lost 0.5

Action Ranking Lowers Link Utilization Replay 97 overload incidents due to link failures 34

Conclusion Mitigation shortens failure recovery time – Simple actions are effective – Made possible by redundancy NetPilot: automating failure mitigation – Recovery time: hour  minutes – Several mitigation scenarios deployed in Bing 35

Thank You! DetectionDiagnosisRepair NetPilot: Automated Mitigation 36