Failure Analysis of Two Internet Services Archana Ganapathi

Slides:



Advertisements
Similar presentations
(Distributed) Denial of Service Nick Feamster CS 4251 Spring 2008.
Advertisements

Chapter 3: Planning a Network Upgrade
Networking Essentials Lab 3 & 4 Review. If you have configured an event log retention setting to Do Not Overwrite Events (Clear Log Manually), what happens.
Large-Scale Distributed Systems Andrew Whitaker CSE451.
Internet Information Server 6.0. IIS 6.0 Enhancements  Fundamental changes, aimed at: Reliability & Availability Reliability & Availability Performance.
DATA CENTER OUTAGE BRIEFING JANUARY 10–11, 2014 INFORMATION AND EDUCATIONAL TECHNOLOGY.
Predictor of Customer Perceived Software Quality By Haroon Malik.
Skyward Server Management Options Mike Bianco. Agenda: Managed Services Overview OpenEdge Management / OpenEdge Explorer OpenEdge Managed Demo.
Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University,
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
© 2007 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.1 Computer Networks and Internets with Internet Applications, 4e By Douglas.
Recovery Oriented Computing: Update Armando Fox (in loco Patterson) Summer ROC Retreat, June 2002.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Maintaining and Updating Windows Server 2008
 Network Management  Network Administrators Jobs  Reasons for using Network Management Systems  Analysing Network Data  Points that must be taken.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
Windows Server 2008 Chapter 11 Last Update
© 2006 Jupitermedia Corporation Webcast TitleSuccessful Rollout Planning 1 January 19, :00pm EST, 11:00am PST George Spafford, President Spafford.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
System Testing There are several steps in testing the system: –Function testing –Performance testing –Acceptance testing –Installation testing.
®® Microsoft Windows 7 for Power Users Tutorial 8 Troubleshooting Windows 7.
DTS Web Hosting, Rates And Services Web Hosting Internet Services Unit May 2006.
Microsoft ® Official Course Module 10 Optimizing and Maintaining Windows ® 8 Client Computers.
Conditions and Terms of Use
Recovery Oriented Computing (ROC) Aaron Brown*, Pete Broadwell, George Candea †, Mike Chen, Leonard Chung*, James Cutler †, Armando Fox †, Archana Ganapathi*,
Computer Maintenance and Troubleshooting
Failure Analysis of the PSTN: 2000 Patricia Enriquez Mills College Oakland, California Mentors: Aaron Brown David Patterson.
Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
Maintaining and Updating Windows Server Monitoring Windows Server It is important to monitor your Server system to make sure it is running smoothly.
Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Akamai capabilities overview and it’s impact on Iowa.Gov and selected web pages.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
NetTech Solutions Common Connectivity Problems Lesson Eight.
Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. 1 Main Frame Computing Objectives Explain why data resides on mainframe.
Information Technology Report Trey Felton Manager, IT Service Delivery October 2011 ERCOT Public.
Module 9 Planning and Implementing Monitoring and Maintenance.
Configuring and Deploying Web Applications Lesson 7.
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.
Section 13: Troubleshooting the Network CSIS 479R Fall 1999 “Network +” George D. Hickman, CNI, CNE.
Jack Malloch Product Service Advisor Global Support Services.
Internet Information Server 6.0 & new management features.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
By the end of this lesson you will be able to explain: 1. Identify the support categories for reported computer problems 2. Use Remote Assistance to connect.
Wyoming Technology Readiness February Agenda Wyoming Training - Feb Technology Readiness  Schedule of events  Components and System Requirements.
Maintaining and Updating Windows Server 2008 Lesson 8.
MCSA Windows Server 2012 Pass Upgrading Your Skills to MCSA Windows Server 2012 Exam By The Help Of Exams4Sure Get Complete File From
Windows Vista Configuration MCTS : Maintenance and Optimization.
Sources of Failure in the Public Switched Telephone Network
How to Fix MSN error code 80072efd and Messages ?
Hands-On Microsoft Windows Server 2008
Large Distributed Systems
Fault Tolerance & Reliability CDA 5140 Spring 2006
Maximum Availability Architecture Enterprise Technology Centre.
How to Fix MSN error code 403 and Messages ?
Lec3: Network Management
Integration services: Analysis Services:
What I Learned Making a Global Web App
Fault Tolerance Distributed Web-based Systems
Why do Internet services fail, and what can be done about it?
Mattan Erez The University of Texas at Austin July 2015
Managing Services with VMM and App Controller
SAP R/3 Installation on WIN NT-ORACLE
Bethesda Cybersecurity Club
The Service Portal What is the Self-Service Web Portal?
The Troubleshooting theory
Lec1: Introduction to Network Management
Presentation transcript:

Failure Analysis of Two Internet Services Archana Ganapathi

24x7 availability is important for Internet Services Unavailability = MTTR/(MTTR+MTTF) Determine reasons for failure and time to repair! Motivation

Recently in the News… “Technicians were installing routers to upgrade the.Net Messenger service, which underlies both Windows Messenger and MSN Messenger. The technicians incorrectly configured the routers. Service was out from 9 a.m. to 2 p.m. Eastern time on Monday. Ironically, the routers were being installed to make the service more reliable.” --posted on cnn.com techweb on Wednesday January 8 th, 2003

Recently in the News… “ AN INTERNET ROUTING error by AT&T effectively shut off access from around 40 percent of the Internet to several major Microsoft Web sites and services on Thursday, Microsoft has said. Access to Microsoft's Hotmail, Messenger, and Passport services as well as some other MSN services was cut for more than an hour after AT&T made routing changes on its backbone. The changes were made in preparation for the addition of capacity between AT&T and Microsoft that is meant to improve access to the services hit by the outage, said Adam Sohn, a spokesman for Microsoft. ” --posted on infoworld.com on Thursday January 9 th, 2003

Obtain real failure data from three Internet Services –Data in this talk is from two of them Validate/characterize failure based on post mortem reports Investigate failure mitigation techniques Approach

Internet Services Online Service/ Internet Portal Global Content Hosting Service Hits/day~100 million~7 million Collocation sites 2 (500 machines)15 (500 machines) PlatformSolaris on SPARC & x86 Open source x86 OS # of months studied 7 months3 months # of problems

Internet Service Architecture Users Front End Back End FE – forward user requests to BE and possibly process data that is returned BE – data storage units NET – interconnections between FE and BE

Fe Be Net Unk Hw Sw Op Ol Df Ev Un Sw config Hw repair Initial Upgrade Fixing Part All None ROC techniques to mitigate failure

Types of Failures Component Failure –Not visible to customers –If not masked, evolves into a… Service Failure –Visible to customers –Service unavailable to end user or significant performance degradation –Always due to a component failure

Online Note: these are the statistics for 7 months of data

Note: these are the statistics for 3 months of data Content

Observations Hardware component failures are masked well in both services Operator induced failures are hardest to mask Compared to Online, Content is less successful in masking failures, especially operator-induced failures (Online: 33% unmasked, Content: 50% unmasked) Note: more than 15% of problems tracked at Content pertain to administrative/operations machines or services

Service Failure Cause by Location Front End Back End Net- work Un- known Online 77%3%18%2% Content 66%11%18%4%

Service Failure Cause by Component OnlineContent Total: 61 failures in 12 months Total: 56 failures in 3 months

Time to Repair (hr) MinMaxMedianAverageTotal # of problems OP SW HW0.1 1 OP SW HW ONLINE CONTENT 8 months 2 months

Customer Impact Part of Customers Affected: –Part/All/None –Online: 50% affected part, 50% affected all customers (3 months of data) –Content: 100% affected part of customers (2 months of data)

Multiple Event Failures Vertically Cascaded Failures: chain of cascaded component failures that lead to service failure. Horizontally Related Failures: multiple independent component failures that contribute to service failure.

Multiple Event Failure Results Online Service Failures (3 months of data) –41% Vertically Cascaded –0% Horizontally Related Content Service Failures(2 months of data) –0% Vertically Cascaded –25% Horizontally Related

Service Failure Mitigation Techniques Total = 40 service failures Online testing26 Expose/monitor failures to reduce TTR12 Expose/monitor failures to reduce TTD11 Configuration checking9 Redundancy9 Online fault/load injection6 Component Isolation5 Pre-deployment fault/load injection3 Proactive restart3 Pre-deployment correctness testing2

Conclusions/Lessons Operator errors are most impacting –Largest single cause of failure –Largest MTTR Most valuable failure mitigation techniques are: –Online testing –Better monitoring tools, and better exposure of component health and error conditions –Configuration testing and increased redundancy

Research Directions Improve classification based on Vertically Cascading and Horizontally Related service failures Further explore failure mitigation techniques Investigate failure models based on time of day Apply current statistics to develop accurate benchmarks

Related Work “Why do Internet services fail, and what can be done about it?” (Oppenheimer D.) “Failure analysis of the PSTN” (Enriquez P.) “Why do computers stop and what can be done about it?” (Gray J.) “Lessons from giant-scale services” (Brewer E.) “How fail-stop are faulty programs?” (Chandra S. and Chen P.M.) “Networked Windows NT System Field Failure Data Analysis” (Xu J., Kalbarczyk Z. and Iyer R.)