Why do Internet services fail, and what can be done about it?

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Networking Essentials Lab 3 & 4 Review. If you have configured an event log retention setting to Do Not Overwrite Events (Clear Log Manually), what happens.
Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
Predictor of Customer Perceived Software Quality By Haroon Malik.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Failure Analysis of Two Internet Services Archana Ganapathi
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Dynamics AX Technical Overview Application Architecture Dynamics AX Technical Overview.
Introduction to Databases Transparencies 1. ©Pearson Education 2009 Objectives Common uses of database systems. Meaning of the term database. Meaning.
Latency as a Performability Metric for Internet Services Pete Broadwell
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
System Testing There are several steps in testing the system: –Function testing –Performance testing –Acceptance testing –Installation testing.
Selecting and Implementing An Embedded Database System Presented by Jeff Webb March 2005 Article written by Michael Olson IEEE Software, 2000.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
CH2 System models.
Enterprise Computing With Aspects of Computer Architecture Jordan Harstad Technology Support Analyst Arizona State University.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
1 Network Monitoring Mi-Jung Choi Dept. of Computer Science KNU
Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University.
Quality of System requirements 1 Performance The performance of a Web service and therefore Solution 2 involves the speed that a request can be processed.
Business Data Communications, Fourth Edition Chapter 11: Network Management.
Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.
Nonbehavioral Specifications Non-behavioral Characteristics Portability Portability Reliability Reliability Efficiency Efficiency Human Engineering.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Ashish Prabhu Douglas Utzig High Availability Systems Group Server Technologies Oracle Corporation.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
UNIT 7 MAINTENANCE OF DIGITAL SWITCHING SYSTEMS 9/23/2016.
Information Systems Security
Introduction to Databases
Sources of Failure in the Public Switched Telephone Network
Chapter 19: Network Management
NERC Lessons Learned Summary
Cluster-Based Scalable
Essentials of UrbanCode Deploy v6.1 QQ147
Slicer: Auto-Sharding for Datacenter Applications
Embracing Failure: A Case for Recovery-Oriented Computing
Integrating HA Legacy Products into OpenSAF based system
Large Distributed Systems
Andy Wang CIS 5930 Computer Systems Performance Analysis
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Maximum Availability Architecture Enterprise Technology Centre.
Introduction to Databases
Outline Introduction Characteristics of intrusion detection systems
Developing Information Systems
Conditions Data access using FroNTier Squid cache Server
Physical Architecture Layer Design
The Top 10 Reasons Why Federated Can’t Succeed
Software Defined Networking (SDN)
Relate to Clients on a business level
Lecture-5 Implementation of Information System Part - I Thepul Ginige
Latency as a Performability Metric: Experimental Results
Web Server Administration
Fault Tolerance Distributed Web-based Systems
Evaluating Transaction System Performance
Chapter 6 – Architectural Design
Administering Your Network
Mattan Erez The University of Texas at Austin July 2015
Cloud computing mechanisms
Progression of Test Categories
Software System Testing
Database System Architectures
The Troubleshooting theory
Terms: Data: Database: Database Management System: INTRODUCTION
Chapter 6: Architectural Design
Presentation transcript:

Why do Internet services fail, and what can be done about it? David Oppenheimer davidopp@cs.berkeley.edu ROC Group, UC Berkeley ROC Retreat, June 2002

Motivation Little understanding of real problems in maintaining 24x7 Internet services Identify the common failure causes of real-world Internet services these are often closely-guarded corporate secrets Identify techniques that would mitigate observed failures Determine fault model for availability and recoverability benchmarks

Sites examined Online service/portal Global content hosting service ~500 machines, 2 facilities ~100 million hits/day all service software custom-written (SPARC/Solaris) Global content hosting service ~500 machines, 4 colo facilities + customer sites all service software custom-written (x86/Linux) Read-mostly Internet site thousands of machines, 4 facilities all service software custom-written (x86)

Outline Motivation Terminology and methodology of the study Analysis of root causes of faults and failures Analysis of techniques for mitigating failure Potential future work

Terminology and Methodology (I) Examined 2 operations problem tracking databases, 1 failure post-mortem report log Two kinds of failures Component failure (“fault”) hardware drive failure, software bug, network switch failure, operator configuration error, … may be masked, but if not, becomes a... Service failure (“failure”) prevents an end-user from accessing the service or a part of the service; or significantly degrades a user-visible aspect of perf. inferred from problem report, not measured externally Every service failure is due to a component failure

Terminology and Methodology (II) Service # of component failures # of resulting service failures period covered in problem reports Online 85 18 4 months Content 99 20 1 month ReadMostly N/A 21 6 months (note that the services are not directly comparable) Problems are categorized by “root cause” first component that failed in the chain of events leading up to the observed failure Two axes for categorizing root cause location: front-end, back-end, network, unknown type: node h/w, node s/w, net h/w, net s/w, operator, environment, overload, unknown

Component failure service failure Redundancy does a good job of preventing hw, sw, net component failures from turning into service failures, but much less effective at preventing operator failures from becoming service faliures (operators affect entire service); net failures smaller degree of net redundancy compared to node redundancy

Service failure (“failure”) causes front-end back-end net unknown Online 72% 28% Content 55% 20% 5% ReadMostly 0% 10% 81% 9% Front-end machines are a significant cause of failure node op net op node hw net hw node sw net sw node unk net unk Online 33% 6% 17% 22% 6% Content 45% 5% 25% 15% ReadMostly 5% 14% 10% 5% 19% Contrary to conventional wisdom, FE machines are a significant source of failure (largely due to op error in cofig/admin FE sw). Almost all problems in ReadMostly network-related (probably b/c simpler and better-tested app sw, less op futzing, higher degree of hw redundancy) Op error largest single cause of service failures in Online and Content; config rather than procedural errors (100%, 89%, 50% op errors config, mostly when ops making changes to the system). During normal maintenance. Networking probs largest cause of failure in ReadMosstly somewhat harder to mask than node hw, sw networks often SPOF No env problems, due to colo Operator error is largest cause of failure for two services, network problems for one service

Service failure average TTR (hours) average TTR in hrs front-end back-end net Online 9.7 10.2 0.75 (*) Content 2.5 14 1.2 (*) ReadMostly 0.17 (*) 1.2 average TTR in hrs node op net op node hw net hw node sw net sw net unk Online 15 1.7 (*) 0.5 (*) 3.7 (4) Content 1.2 0.23 1.2 (*) ReadMostly 0.17 (*) 0.13 6.0 (*) 1.0 0.11 (*) denotes only 1-2 failures in this category Front-end TTR < Back-end TTR Network problems have smallest TTR TTR and mitigate for Online different set of problems FE > BE in #, but TTR(FE) < TTR(BE) Content: be sw bugs that were hard to isolate and fix Online: op error that was hard to undo Net probs have smallest TTR

Component failure (“fault”) causes front-end back-end net Online 76% 5% 19% Content 34% 30% Component failures arise primarily in the front-end node op net op node hw net hw node sw net sw node unk net unk env Online 12% 5% 38% 5% 12% 4% 4% 5% 0% Content 18% 1% 4% 1% 41% 1% 1% 27% 1% Operator errors are less common than hardware/ software component failures, but are less frequently masked Component failure arise primarily in FE, like service failures hw and/or sw probs dominate op errors (op less frequently masked, not less common)

Techniques for mitigating failure (I) How techniques could have helped Techniques we studied testing (pre-test or online-test) redundancy fault injection and load testing (pre- or online) configuration checking isolation restart better exposing and diagnosing problems Testing: pre-deployment component faults, online testing detects faulty components Redundancy: prevent component failures -> service failures Pre-flt/ld: no deploy faulty components, oinline detects before fail Config: prevent faulty config files in deployed systems Isol: prevent component failure -> service failure Restart: prevent faulty components w/ latent errors from failing in deployed sys Exp/mon: reduce TTD, TTR

Techniques for mitigating failure (II) # of problems mitigated (/19) online testing 11 redundancy 8 online fault/load injection 3 configuration checking isolation 2 pre-deployment fault/load injection restart 1 pre-deployment correctness testing better exposing/monitoring errors (TTD) better exposing/monitoring errors (TTR)

Comments on studying failure data Problem tracking DB may skew results operator can cover up errors before manifests as a (new) failure Multiple-choice fields of problem reports much less useful than operator narrative form categories were not filled out correctly form categories were not specific enough form categories didn’t allow multiple causes No measure of customer impact How would you build an anonymized meta-database?

Future work (I) Continuing analysis of failure data New site? (e-commerce, storage system vendor, …) More problems from Content and Online? say something more statistically meaningful about MTTR value of approaches to mitigating problems cascading failures, problem scopes different time period from Content (longitudinal study) Additional metrics? taking into account customer impact (customer-minutes, fraction of service affected, …) Nature of original fault, how fixed? Standardized, anonymized failure database?

Future work (II) Recovery benchmarks (akin to dependability b/m’s) use failure data to determine fault model for fault injection recovery benchmark goals evaluate existing recovery mechanisms common-case overhead, recovery performance, correctness, … match user needs/policies to available recovery mechanisms design systems with efficient, tunable recovery properties systems can be built/configured to have different recoverability characteristics (RAID levels, check- pointing frequency, degree of error checking, etc.) procedure choose application (storage system, three-tier application, globally distributed/p2p app, etc.) choose workload (user requests + operator preventative maintenance and service upgrade) choose representative faultload based on failure data choose QoS metrics (latency, throughput, fraction of service available, # users affected, data consistency, data loss, degree of remaining redundancy, …)

Future Work (III) Recovery benchmarks, cont. issues language for describing faults and their frequencies hw, sw, net including WAN, operator allows automated stochastic fault injection quantitative models for describing data protection/recovery mechanisms how faults affect QoS isolated & correlated faults like to allow prediction of recovery behavior of single component and systems of components synthesizing overall recoverability metric(s) defining workload for systems with complicated interfaces (e.g., whole “services”)

Conclusion Failure causes Mitigating failures operator error #1 contributor to service failures operator error most difficult type of failure to mask; generally due to configuration errors front-end software can be a significant cause of user-visible failures back-end failures, while infrequent, take longer to repair than do front-end failures Mitigating failures online correctness testing would have helped a lot, but hard to implement better exposing, monitoring for failures would have helped a lot, but must be built in from ground up for configuration problems, match system architecture to actual configuration redundancy, isolation, incremental rollout, restart, offline testing, operator+developer interaction are all important (and often already used)

Backup Slides

Techniques for mitigating failure (III) implementation cost potential reliability cost performance impact online correctness testing medium to high low to medium redundancy low very low online fault/load injection high config checking medium zero isolation pre-deployment fault/load injection restart pre-deployment testing better exposing/ monitoring failures low (false alarms)

Geographic distribution 1. Online service/portal 2. Global storage service 3. High-traffic Internet site

1. Online service/portal site x86/ Solaris clients (400 total) Internet Load-balancing switch web proxy cache stateless workers for stateless services (e.g. content portals) (8) (8) stateless workers for stateful services (e.g. mail, news, favorites) SPARC/ Solaris SPARC/ Solaris (6 total) (50 total) storage of customer records, crypto keys, billing info, etc. ~65K users; email, newsrc, prefs, etc. news article storage (6 total) Filesystem-based storage (NetApp) Database

2. Global content hosting service site paired client service proxies Internet to paired backup site Load-balancing switch (14 total) metadata servers (100 total) data storage servers

3. Read-mostly Internet site clients Internet to paired backup site to paired backup site user queries/ responses user queries/ responses Load-balancing switch Load-balancing switch (30 total) web front-ends storage back-ends (3000 total)