Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University.

Slides:



Advertisements
Similar presentations
Software change management
Advertisements

Steve Lewis J.D. Edwards & Company
Configuration management
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
DARPA OASIS PI Meeting – Santa Fe – July 24-27, 2001Slide 1 Aegis Research Corporation Not for Public Release Survivability Validation Framework for Intrusion.
The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.
Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
DARPA ITS PI Meeting – Honolulu – July 17-21, 2000Slide 1 Aegis Research Corporation Intrusion Tolerance Using Masking, Redundancy and Dispersion DARPA.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Making Services Fault Tolerant
Lecturer: Sebastian Coope Ashton Building, Room G.18 COMP 201 web-page: Lecture.
Notes to the presenter. I would like to thank Jim Waldo, Jon Bostrom, and Dennis Govoni. They helped me put this presentation together for the field.
Chapter 12: Web Usage Mining - An introduction
Challenges in Large Enterprise Data Management James Hamilton Microsoft SQL Server
Failure Analysis of Two Internet Services Archana Ganapathi
Slide 1 ISTORE: System Support for Introspective Storage Appliances Aaron Brown, David Oppenheimer, and David Patterson Computer Science Division University.
Lesson 1: Configuring Network Load Balancing
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
1 Reliable Adaptive Distributed Systems Armando Fox, Michael Jordan, Randy H. Katz, David Patterson, George Necula, Ion Stoica, Doug Tygar.
Maintaining and Updating Windows Server 2008
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Securing Legacy Software SoBeNet User group meeting 25/06/2004.
Database Systems: Design, Implementation, and Management Ninth Edition
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Linux Operations and Administration
1 Autonomic Computing An Introduction Guenter Kickinger.
1 10 THE INTERNET AND THE NEW INFORMATION TECHNOLOGY INFRASTRUCTURE.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS (Cont’d) Instructor Ms. Arwa Binsaleh.
Chapter 1: Computing with Services Service-Oriented Computing: Semantics, Processes, Agents – Munindar P. Singh and Michael N. Huhns, Wiley, 2005.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
Happy Network Administrators  Happy Packets  Happy Users WIRED Position Statement Aman Shaikh AT&T Labs – Research October 16,
Cluster Reliability Project ISIS Vanderbilt University.
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
Lecture On Introduction (DBMS) By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Architectural Design lecture 10. Topics covered Architectural design decisions System organisation Control styles Reference architectures.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Lecture # 3 & 4 Chapter # 2 Database System Concepts and Architecture Muhammad Emran Database Systems 1.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
® IBM Software Group © 2007 IBM Corporation Best Practices for Session Management
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Enabling Reuse-Based Software Development of Large-Scale Systems IEEE Transactions on Software Engineering, Volume 31, Issue 6, June 2005 Richard W. Selby,
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
Recovery-Oriented Computing Discovering Correctness Constraints for Self-Management of System Configuration Emre Kıcıman and Yi-Min Wang
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Lesson 19-E-Commerce Security Needs. Overview Understand e-commerce services. Understand the importance of availability. Implement client-side security.
Chapter 1: Computing with Services Service-Oriented Computing: Semantics, Processes, Agents – Munindar P. Singh and Michael N. Huhns, Wiley, 2005.
REST By: Vishwanath Vineet.
Lecture On Introduction (DBMS) By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 1 Database Systems.
IS3220 Information Technology Infrastructure Security
A Validation System for the Complex Event Processing Directives of the ATLAS Shifter Assistant Tool G. Anders (CERN), G. Avolio (CERN), A. Kazarov (PNPI),
VIRTUAL SERVERS Chapter 7. 2 OVERVIEW Exchange Server 2003 virtual servers Virtual servers in a clustering environment Creating additional virtual servers.
Maintaining and Updating Windows Server 2008 Lesson 8.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Chapter 5:Architectural Design l Establishing the overall structure of a software.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Fail-stutter Behavior Characterization of NFS
Embracing Failure: A Case for Recovery-Oriented Computing
Large Distributed Systems
Maximum Availability Architecture Enterprise Technology Centre.
Database Management System (DBMS)
IS4680 Security Auditing for Compliance
Fault Tolerance Distributed Web-based Systems
Why do Internet services fail, and what can be done about it?
Presentation transcript:

Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University of California at Berkeley IBM Conference on Proactive Problem Prediction, Avoidance and Diagnosis April 28, 2003

Slide 2 Motivation Internet service availability is important – , instant messenger, web search, e-commerce, … User-visible failures are relatively frequent –especially if use non-binary definition of “failure” To improve availability, must know what causes failures –know where to focus research –objectively gauge potential benefit of techniques Approach: study failures from real Internet svcs. –evaluation includes impact of humans & networks

Slide 3 Outline Describe methodology and services studied Identify most significant failure root causes –source: type of component –impact: number of incidents, contribution to TTR Evaluate HA techniques to see which of them would mitigate the observed failures Drill down on one cause: operator error Future directions for studying failure data

Slide 4 Methodology Obtain “failure” data from three Internet services –two services: problem tracking database –one service: post-mortems of user-visible failures

Slide 5 Methodology Obtain “failure” data from three Internet services –two services: problem tracking database –one service: post-mortems of user-visible failures We analyzed each incident –failure root cause »hardware, software, operator, environment, unknown –type of failure »“component failure” vs. “service failure” –time to diagnose + repair (TTR)

Slide 6 Methodology Obtain “failure” data from three Internet services –two services: problem tracking database –one service: post-mortems of user-visible failures We analyzed each incident –failure root cause »hardware, software, operator, environment, unknown –type of failure »“component failure” vs. “service failure” –time to diagnose + repair (TTR) Did not look at security problems

Slide 7 Comparing the three services characteristicOnlineReadMostlyContent hits per day~100 million ~7 million # of machines 2 sites > 4 sites ~15 sites front-end node architecture custom s/w; Solaris on SPARC, x86 custom s/w; open-source OS on x86 custom s/w; open-source OS on x86; back-end node architecture Network Appliance filers custom s/w; open-source OS on x86 period studied7 months6 months3 months # component failures 296N/A205 # service failures

Slide 8 Outline Describe methodology and services studied Identify most significant failure root causes –source: type of component –impact: number of incidents, contribution to TTR Evaluate HA techniques to see which of them would mitigate the observed failures Drill down on one cause: operator error Future directions for studying failure data

Slide 9 Failure cause by % of service failures OnlineContent ReadMostly hardware 10% software 25% network 20% operator 33% unknown 12% hardware 2% software 25% network 15% operator 36% unknown 22% software 5% network 62% operator 19% unknown 14%

Slide 10 hardware 6% software 17% network 1% operator 76% unknown 1% software 6% network 19% operator 75% Failure cause by % of TTR OnlineContent ReadMostly network 97% operator 3%

Slide 11 Most important failure root cause? Operator error generally the largest cause of service failure –even more significant as fraction of total “downtime” –configuration errors > 50% of operator errors –generally happened when making changes, not repairs Network problems significant cause of failures

Slide 12 Related work: failure causes Tandem systems (Gray) –1985: Operator 42%, software 25%, hardware 18% –1989: Operator 15%, software 55%, hardware 14% VAX (Murphy) –1993: Operator 50%, software 20%, hardware 10% Public Telephone Network (Kuhn, Enriquez) –1997: Operator 50%, software 14%, hardware 19% –2002: Operator 54%, software 7%, hardware 30%

Slide 13 Outline Describe methodology and services studied Identify most significant failure root causes –source: type of component –impact: number of incidents, contribution to TTR Evaluate HA techniques to see which of them would mitigate the observed failures Drill down on one cause: operator error Future directions for studying failure data

Slide 14 Potential effectiveness of techniques? technique post-deployment correctness testing* expose/monitor failures* redundancy* automatic configuration checking post-deploy. fault injection/load testing component isolation* pre-deployment fault injection/load test proactive restart* pre-deployment correctness testing* * indicates technique already used by Online

Slide 15 Potential effectiveness of techniques? technique failures avoided / mitigated post-deployment correctness testing*26 expose/monitor failures*12 redundancy*9 automatic configuration checking9 post-deploy. fault injection/load testing6 component isolation*5 pre-deployment fault injection/load test3 proactive restart*3 pre-deployment correctness testing*2 (40 service failures examined)

Slide 16 Outline Describe methodology and services studied Identify most significant failure root causes –source: type of component –impact: number of incidents, contribution to TTR Evaluate existing techniques to see which of them would mitigate the observed failures Drill down on one cause: operator error Future directions for studying failure data

Slide 17 Drilling down: operator error Why does operator error cause so many svc. failures? Existing techniques (e.g., redundancy) are minimally effective at masking operator error 50% 24% 19% 6% operatorsoftwarenetworkhardware 25% 21% 19% 3% operatorsoftwarenetworkhardware % of component failures resulting in service failures ContentOnline

Slide 18 Drilling down: operator error TTR Detection and diagnosis difficult because of non-failstop failures and poor error checking Why does operator error contribute so much to TTR? hardware 6% software 17% network 1% operator 76% unknown 1% software 6% network 19% operator 75% OnlineContent

Slide 19 Future directions Correlate problem reports with end-to-end and per-component metrics –retrospective: pin down root cause of “uknown” problems –introspective: detect and determine root cause online –prospective: detect precursors to failure or SLA violation –include interactions among distributed services Create a public failure data repository –standard failure causes, impact metrics, anonymization –security (not just reliability) –automatic analysis (mine for detection, diagnosis, repairs) Study additional types of sites –transactional, intranets, peer-to-peer Perform controlled laboratory experiments

Slide 20 Conclusion Operator error large cause of failures, downtime Many failures could be mitigated with –better post-deployment testing –automatic configuration checking –better error detection and diagnosis Longer-term: concern for operators must be built into systems from the ground up –make systems robust to operator error –reduce time it takes operators to detect, diagnose, and repair problems »continuum from helping operators to full automation

Willing to contribute failure data, or information about problem detection/diagnosis techniques?

Slide 22 Backup Slides

Slide 23 Online architecture web proxy cache (400 total) stateless services (e.g. content portals) (8) stateful services (e.g. mail, news) (48 total) (50 total) storage of customer records, crypto keys, billing info, etc. Internet Load-balancing switch clients (6 total) Filesystem-based storage (NetApp) ~65K users; , newsrc, prefs, etc. news article storage Database to second site user queries/ responses

Slide 24 ReadMostly architecture Load-balancing switch clients (O(10) total)) web front- ends Internet (O(1000) total) storage back-ends Load-balancing switch to paired backup site user queries/ responses

Slide 25 Content architecture Load-balancing switch paired client service proxies (14 total) (100 total) data storage servers metadata servers Internet to paired backup site user queries/ responses

Slide 26 Operator case study #1 Symptom: postings to internal newsgroups are not appearing Reason: news server drops postings Root cause: operator error –username lookup daemon removed from news server Lessons –operators must understand high-level dependencies and interactions among components –online testing »e.g., regression testing after configuration changes –better exposing failures, better diagnosis, …

Slide 27 Operator case study #2 Symptom: chat service stops working Reason: service nodes cannot connect to (external) chat service Root cause: operator error –operator at chat service reconfigured firewall; accidentally blocked service IP addresses Lessons –same as before, but must extend across services »operators must understand high-level dependencies and interactions among components »online testing »better error reporting and diagnosis –cross-service human collaboration important

Slide 28 Improving detection and diagnosis Understanding system config. and dependencies –operator mental model should match changing reality –including across administrative boundaries Enabling collaboration –among operators within and among service Integration of historical record –past configs., mon. data, actions, reasons, results (+/-) –need structured expression of sys. config, state, actions »problem tracking database is unstructured version Statistical/machine learning techniques to infer misconfiguration and other operator errors?

Slide 29 Reducing operator errors Understanding configuration (previous slide) Impact analysis Sanity checking –built-in sanity constraints –incorporate site-specific or higher-level rules? Abstract service description language –specify desired system configuration/architecture –for checking: high-level config. is form of semantic redundancy –enables automation: generate low-level configurations from high-level specification –extend to dynamic behavior?

Slide 30 The operator problem Operator largely ignored in designing server systems –operator assumed to be an expert, not a first-class user –impact: causes failures & extends TTD and TTR for failures –more than 15% of problems tracked at Content pertain to administrative/operations machines or services More effort needed in designing systems to –prevent operator error –help humans detect, diagnose, repair problems due to any cause Hypothesis: making server systems human-centric –reduce incidence and impact of operator error –reduce time to detect, diagnose, and repair problems The operator problem is largely a systems problem –make the uncommon case fast, safe, and easy

Slide 31 Failure location by % of incidents 77% front-end 3% back-end 18% net 2% unk. Online 10% back-end Content 66% front-end 18% net 4% unk. 11% back-end 81% net 9% unk. ReadMostly

Slide 32 Summary: failure location For two services, front-end nodes largest location of service failure incidents Failure location by fraction of total TTR was service-specific Need to examine more services to understand what this means –e.g., what is dependence between # of failures and # of components in each part of service

Slide 33 Operator case study #3 Symptom: problem tracking database disappears Reason: disk on primary died, then operator re-imaged backup machine Root cause: hardware failure; operator error? Lessons –operators must understand high-level dependencies and interactions among components »including dynamic system configuration/status know when margin of safety is reduced hard when redundancy masks component failures –minimize window of vulnerability whenever possible –not always easy to define what is a failure

Slide 34 Difficulties with prob. tracking DB’s Forms are unreliable –incorrectly filled out, vague categories, single cause, … –we relied on operator narrartives Only gives part of the picture »better if correlated with per-component logs and end-user availability »filtering may skew results operator can cover up errors before manifests as a (new) failure => operator failure % is underestimate »only includes unplanned events

Slide 35 What’s the problem? Many, many components, with complex interactions Many failures –4-19 user-visible failures per month in “24x7” services System in constant flux Modern services span administrative boundaries Architecting for high availability, performance, and modularity often hides problems –layers of hierarchy, redundancy, and indirection => hard to know what components involved in processing req –asynchronous communications => may have no explicit failure notification (if lucky, a timeout) –built-in redundancy, retry, “best effort” => subtle performance anomalies instead of fail-stop failures –each component has its own low-level configuration file/mechanism => misunderstood config, wrong new config (e.g., inconsistent)

Slide 36 Failure timeline normal operation service QoS significantly impacted (“service failure”) service QoS impacted negligibly problem in queue for diagnosis problem in queue for repair component in repair comp. fault comp. failure failure detected diag. completed repair initiated repair completed problem in diagnosis diag. initiated component failure failure detected diagnosis completed repair completed

Slide 37 Failure mitigation costs technique implement. cost reliability cost perf. impact online correctness testing DBB expose/monitor failures CAA redundancy AAA configuration checking CAA online fault/load injection FFD component isolation CAC pre-deploy. fault/load inject FAA proactive restart AAA pre-deploy. correctness test DAA

Slide 38 Failure cause by % of TTR Online 76% operator 1% net 6% node hw 17% node sw 1% node unk Content 75% operator 19% net 6% node sw ReadMostly 3% operator 97% net

Slide 39 Failure location by % of TTR front-end back-end network 69% 17% 14% 36% 61% 3% 1% 99% Online (FE:BE 100:1) Content (FE:BE 0.1:1) ReadMostly (FE:BE 1:100)

Slide 40 Geographic distribution 1. Online service/portal 3. High-traffic Internet site 2. Global storage service