Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.

Slides:



Advertisements
Similar presentations
EdgeNet2006 Summit1 Virtual LAN as A Network Control Mechanism Tzi-cker Chiueh Computer Science Department Stony Brook University.
Advertisements

Group Research 1: AKHTAR, Kamran SU, Hao SUN, Qiang TANG, Yue
SHARKFEST '08 | Foothill College | March 31 - April 2, 2008 Non-Intrusive Out-of-Band Network Monitoring Utilizing a Data-Access Switch April 1, 2008 Patrick.
Network Management Basics Network management requirements OSI Management Functional Areas –Network monitoring: performance, fault, accounting –Network.
Mark Heggli Consultant to the World Bank Expert Real-time Hydrology Information Systems Workshop Module 4: Data Management Solutions for a Modernized HIS.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
Business Continuity Section 3(chapter 8) BC:ISMDR:BEIT:VIII:chap8:Madhu N PIIT1.
1 Semester 2 Module 4 Learning about Other Devices Yuda college of business James Chen
Part IV: BGP Routing Instability. March 8, BGP routing updates  Route updates at prefix level  No activity in “steady state”  Routing messages.
1 Experimental Study of Internet Stability and Wide-Area Backbone Failure Craig Labovitz, Abha Ahuja Merit Network, Inc Presented by Changchun Zou.
© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.
11 TROUBLESHOOTING Chapter 12. Chapter 12: TROUBLESHOOTING2 OVERVIEW  Determine whether a network communications problem is related to TCP/IP.  Understand.
Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.
Making Services Fault Tolerant
Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, Y. Richard Yang Laboratory of Networked Systems Yale University.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Informed Detour Selection Helps Reliability Boulat A. Bash.
Reliability on Web Services Pat Chan 31 Oct 2006.
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
Lecture 11 Reliability and Security in IT infrastructure.
Usage Compute Time Average Inactivity Period Compute Time Average Usage Compute Time Compute Time Average Usage.
Lesson 1: Configuring Network Load Balancing
Michael Over.  Which devices/links are most unreliable?  What causes failures?  How do failures impact network traffic?  How effective is network.
A Scalable, Commodity Data Center Network Architecture.
Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.
CSE390 – Advanced Computer Networks Lecture 16: Data center networks (The Underbelly of the Internet) Based on slides by D. NEU. Updated by.
Internet Traffic Management. Basic Concept of Traffic Need of Traffic Management Measuring Traffic Traffic Control and Management Quality and Pricing.
Microsoft ® Official Course Module 10 Optimizing and Maintaining Windows ® 8 Client Computers.
Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.
Management Functions and Reference Models W.lilakiatsakun.
© 2011 Cisco and/or its affiliates. All rights reserved. 1 High Performance Network Analysis Enterprise Operate Practice Cisco Services Andrew Wojtkowiak.
1. There are different assistant software tools and methods that help in managing the network in different things such as: 1. Special management programs.
Current Practice for Network Analysis in CSTNet Chunjing Han CSTNET, CNIC
NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.
The University of Bolton School of Games Computing & Creative Technologies LCT2516 Network Architecture CCNA Exploration LAN Switching and Wireless Chapter.
workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.
Aditya Akella The Performance Benefits of Multihoming Aditya Akella CMU With Bruce Maggs, Srini Seshan, Anees Shaikh and Ramesh Sitaraman.
Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
FireProof. The Challenge Firewall - the challenge Network security devices Critical gateway to your network Constant service The Challenge.
Business Data Communications, Fourth Edition Chapter 11: Network Management.
Clustering In A SAN For High Availability Steve Dalton, President and CEO Gadzoox Networks September 2002.
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
Detection of Routing Loops and Analysis of Its Causes Sue Moon Dept. of Computer Science KAIST Joint work with Urs Hengartner, Ashwin Sridharan, Richard.
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNP 1 v3.0 Module 1 Overview of Scalable Internetworks.
Fundamental Network Improvements Summer 2012 Activity May 14, 2012.
Chapter 13: LAN Maintenance. Documentation Document your LAN so that you have a record of equipment location and configuration. Documentation should include.
TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.
Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.
1 Developing Aerospace Applications with a Reliable Web Services Paradigm Pat. P. W. Chan and Michael R. Lyu Department of Computer Science and Engineering.
Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.
Confidential Rapid Troubleshooting for Data, VoIP, and Video VoIP Performance Manager.
CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,
1 Root-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO PathSolutions.
Network Topology Computer network topology is the way various components of a network (like nodes, links, peripherals, etc) are arranged. Network topologies.
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
PART1 Data collection methodology and NM paradigms 1.
PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.
CompTIA Security+ Study Guide (SY0-401)
Problem: Internet diagnostics and forensics
CIS 700-5: The Design and Implementation of Cloud Networks
Selected ICT-based Wide-Area Monitoring Protection and Control Systems (WAMPAC) applications
Large Distributed Systems
Data Management Solutions for a Modernized HIS
CompTIA Security+ Study Guide (SY0-401)
Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,
COS 561: Advanced Computer Networks
COS 461: Computer Networks
In-network computation
Presentation transcript:

Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan Microsoft Research 1 SIGCOMM 2011 Toronto, ON Aug. 18, 2011

Motivation 2

$5,600 per minute We need to understand failures to prevent and mitigate them! 3

Overview Our goal: Our goal: Improve reliability by understanding network failures 1.Failure characterization – Most failure prone components – Understanding root cause 2.What is the impact of failure? 3.Is redundancy effective? Our contribution: Our contribution: First large-scale empirical study of network failures across multiple DCs Methodology to extract failures from noisy data sources. Correlate events with network traffic to estimate impact Analyzing implications for future data center networks 4

Road Map Motivation Background & Methodology Results 1.Characterizing failures 2.Do current network redundancy strategies help? Conclusions 5

Internet Data center networks overview ServersServers Top of Rack (ToR) switch Aggregation “Agg” switch Load balancers Access routers/network “core” fabric 6

Internet Data center networks overview 7 How effective is redundancy? What is the impact of failure? Which components are most failure prone? What causes failures? ? ?

Failure event information flow Failure is logged in numerous data sources LINK DOWN! Syslog, SNMP traps/polling Network event logs Troubleshooting Tickets Troubleshooting LINK DOWN! Ticket ID: 34 LINK DOWN! Ticket ID: 34 LINK DOWN! Diary entries, root cause Diary entries, root cause Network traffic logs 5 min traffic averages on links 8

Data summary One year of event logs from Oct Sept – Network event logs and troubleshooting tickets Network event logs are a combination of Syslog, SNMP traps and polling – Caveat: may miss some events e.g., UDP, correlated faults actionable Filtered by operators to actionable events – … still many warnings from various software daemons running 9 Key challenge: How to extract failures of interest?

Network event logs Extracting failures from event logs Defining failures – Device failure: – Device failure: device is no longer forwarding traffic. – Link failure: – Link failure: connection between two interfaces is down. Detected by monitoring interface state. Dealing with inconsistent data: – Devices: Correlate with link failures – Links: Reconstruct state from logged messages Correlate with network traffic to determine impact 10

Reconstructing device state Devices may send spurious DOWN messages at least one Verify at least one link on device fails within five minutes – Conservative to account for message loss (correlated failures) DEVICE DOWN! Top-of-rack switch Aggregation switch 1 Aggregation switch 2 LINK DOWN! This sanity check reduces device failures by 10x 11

Reconstructing link state Inconsistencies in link failure events – Note: our logs bind each link down to the time it is resolved What we expect UP DOWN Link state LINK DOWN! LINK UP! time 12

Inconsistencies in link failure events – Note: our logs bind each link down to the time it is resolved Reconstructing link state What we sometimes see. UP DOWN Link state LINK DOWN 1! LINK UP 1! time LINK DOWN 2! LINK UP 2! ? ? How to deal with discrepancies? 1. Take the earliest of the down times 2. Take the earliest of the up times 13

Identifying failures with impact Summary of impact: Summary of impact: – 28.6% of failures impact network traffic – 41.2% of failures were on links carrying no traffic E.g., scheduled maintenance activities Caveat: Caveat: Impact is only on network traffic not necessarily applications! – Redundancy: Network, compute, storage mask outages Network traffic logs LINK DOWNLINK UP BEFORE DURING AFTER Correlate link failures with network traffic time Only consider events where traffic decreases 14

Road Map Motivation Background & MethodologyResults 1.Characterizing failures – Distribution of failures over measurement period. – Which components fail most? – How long do failures take to mitigate? 2.Do current network redundancy strategies help? Conclusions 15

All Failures 46K Visualization of failure panorama: Sep’09 to Sep’10 Widespread failures Long lived failures. 16 Link Y had failure on day X.

All Failures 46K Visualization of failure panorama: Sep’09 to Sep’10 Failures with Impact 28% Load balancer update (multiple data centers) Load balancer update (multiple data centers) Component failure: link failures on multiple ports 17

Internet Which devices cause most failures? 18 ?

Which devices cause most failures? Top of rack switches have few failures… (annual prob. of failure <5%) Top of rack switches have few failures… (annual prob. of failure <5%) Load balancer 1: very little downtime relative to number of failures. 19 …but a lot of downtime! Load Balancer 1 Load Balancer 2 Top of Rack 2 Aggregation Switch Top of Rack 1 Load Balancer 3 Device Type

Internet How long do failures take to resolve? 20

How long do failures take to resolve? Load balancer 1: short-lived transient faults Correlated failures on ToRs connected to the same Aggs. 21 Median time to repair: 4 mins Median time to repair: 4 mins Median time to repair: ToR-1: 3.6 hrs ToR-2: 22 min Median time to repair: ToR-1: 3.6 hrs ToR-2: 22 min Median time to repair: 5 minutes Mean: 2.7 hours Median time to repair: 5 minutes Mean: 2.7 hours Load Balancer 1 Load Balancer 2 Top of Rack 1 Load Balancer 3 Top of Rack 2 Aggregation Switch Overall

Summary Data center networks are highly reliable – Majority of components have four 9’s of reliability Low-cost top of rack switches have highest reliability – <5% probability of failure …but most downtime – Because they are lower priority component Load balancers experience many short lived faults – Root cause: software bugs, configuration errors and hardware faults Software and hardware faults dominate failures – …but hardware faults contribute most downtime 22

Road Map Motivation Background & Methodology Results 1.Characterizing failures 2.Do current network redundancy strategies help? Conclusions 23

Is redundancy effective in reducing impact? Internet Redundant devices/links to mask failures This is expensive! (management overhead + $$$) Goal: Reroute traffic along available paths How effective is this in practice? 24 X

Measuring the effectiveness of redundancy Agg. switch (primary) Agg. switch (backup) Acc. router (primary) Acc. router (backup) X 25

Is redundancy effective in reducing impact? Core link failures have most impact… … but redundancy masks it Less impact lower in the topology Redundancy is least effective for AggS and AccR Overall increase of 40% in terms of traffic due to redundancy 26 Internet

Road Map Motivation Background & Methodology Results 1.Characterizing failures 2.Do current network redundancy strategies help?Conclusions 27

Conclusions Goal: Understand failures in data center networks Goal: Understand failures in data center networks – Empirical study of data center failures Key observations: Key observations: – Data center networks have high reliability – Low-cost switches exhibit high reliability – Load balancers are subject to transient faults – Failures may lead to loss of small packets Future directions: Future directions: – Study application level failures and their causes – Further study of redundancy effectiveness 28

Thanks! Contact: Project page: 29