Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University.

Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University of California at Berkeley IBM Conference on Proactive Problem Prediction, Avoidance and Diagnosis April 28, 2003

Motivation Internet service availability is important –email, instant messenger, web search, e-commerce, … User-visible failures are relatively frequent –especially if use non-binary definition of “failure” To improve availability, must know what causes failures –know where to focus research –objectively gauge potential benefit of techniques Approach: study failures from real Internet svcs. –evaluation includes impact of humans & networks

Outline Describe methodology and services studied Identify most significant failure root causes –source: type of component –impact: number of incidents, contribution to TTR Evaluate HA techniques to see which of them would mitigate the observed failures Drill down on one cause: operator error Future directions for studying failure data

Methodology Obtain “failure” data from three Internet services –two services: problem tracking database –one service: post-mortems of user-visible failures

Methodology Obtain “failure” data from three Internet services –two services: problem tracking database –one service: post-mortems of user-visible failures We analyzed each incident –failure root cause »hardware, software, operator, environment, unknown –type of failure »“component failure” vs. “service failure” –time to diagnose + repair (TTR)

Methodology Obtain “failure” data from three Internet services –two services: problem tracking database –one service: post-mortems of user-visible failures We analyzed each incident –failure root cause »hardware, software, operator, environment, unknown –type of failure »“component failure” vs. “service failure” –time to diagnose + repair (TTR) Did not look at security problems

Comparing the three services characteristicOnlineReadMostlyContent hits per day~100 million ~7 million # of machines ~500 @ 2 sites > 2000 @ 4 sites ~500 @ ~15 sites front-end node architecture custom s/w; Solaris on SPARC, x86 custom s/w; open-source OS on x86 custom s/w; open-source OS on x86; back-end node architecture Network Appliance filers custom s/w; open-source OS on x86 period studied7 months6 months3 months # component failures 296N/A205 # service failures 402156

Failure cause by % of service failures OnlineContent ReadMostly hardware 10% software 25% network 20% operator 33% unknown 12% hardware 2% software 25% network 15% operator 36% unknown 22% software 5% network 62% operator 19% unknown 14%

hardware 6% software 17% network 1% operator 76% unknown 1% software 6% network 19% operator 75% Failure cause by % of TTR OnlineContent ReadMostly network 97% operator 3%

Most important failure root cause? Operator error generally the largest cause of service failure –even more significant as fraction of total “downtime” –configuration errors > 50% of operator errors –generally happened when making changes, not repairs Network problems significant cause of failures

Related work: failure causes Tandem systems (Gray) –1985: Operator 42%, software 25%, hardware 18% –1989: Operator 15%, software 55%, hardware 14% VAX (Murphy) –1993: Operator 50%, software 20%, hardware 10% Public Telephone Network (Kuhn, Enriquez) –1997: Operator 50%, software 14%, hardware 19% –2002: Operator 54%, software 7%, hardware 30%

Potential effectiveness of techniques? technique post-deployment correctness testing* expose/monitor failures* redundancy* automatic configuration checking post-deploy. fault injection/load testing component isolation* pre-deployment fault injection/load test proactive restart* pre-deployment correctness testing* * indicates technique already used by Online

Potential effectiveness of techniques? technique failures avoided / mitigated post-deployment correctness testing*26 expose/monitor failures*12 redundancy*9 automatic configuration checking9 post-deploy. fault injection/load testing6 component isolation*5 pre-deployment fault injection/load test3 proactive restart*3 pre-deployment correctness testing*2 (40 service failures examined)

Outline Describe methodology and services studied Identify most significant failure root causes –source: type of component –impact: number of incidents, contribution to TTR Evaluate existing techniques to see which of them would mitigate the observed failures Drill down on one cause: operator error Future directions for studying failure data

Drilling down: operator error Why does operator error cause so many svc. failures? Existing techniques (e.g., redundancy) are minimally effective at masking operator error 50% 24% 19% 6% operatorsoftwarenetworkhardware 25% 21% 19% 3% operatorsoftwarenetworkhardware % of component failures resulting in service failures ContentOnline

Drilling down: operator error TTR Detection and diagnosis difficult because of non-failstop failures and poor error checking Why does operator error contribute so much to TTR? hardware 6% software 17% network 1% operator 76% unknown 1% software 6% network 19% operator 75% OnlineContent

Future directions Correlate problem reports with end-to-end and per-component metrics –retrospective: pin down root cause of “uknown” problems –introspective: detect and determine root cause online –prospective: detect precursors to failure or SLA violation –include interactions among distributed services Create a public failure data repository –standard failure causes, impact metrics, anonymization –security (not just reliability) –automatic analysis (mine for detection, diagnosis, repairs) Study additional types of sites –transactional, intranets, peer-to-peer Perform controlled laboratory experiments

Conclusion Operator error large cause of failures, downtime Many failures could be mitigated with –better post-deployment testing –automatic configuration checking –better error detection and diagnosis Longer-term: concern for operators must be built into systems from the ground up –make systems robust to operator error –reduce time it takes operators to detect, diagnose, and repair problems »continuum from helping operators to full automation

Willing to contribute failure data, or information about problem detection/diagnosis techniques? davidopp@cs.berkeley.edu

Backup Slides

Online architecture web proxy cache (400 total) stateless services (e.g. content portals) (8) stateful services (e.g. mail, news) (48 total) (50 total) storage of customer records, crypto keys, billing info, etc. Internet Load-balancing switch clients (6 total) Filesystem-based storage (NetApp) ~65K users; email, newsrc, prefs, etc. news article storage Database to second site user queries/ responses

ReadMostly architecture Load-balancing switch clients (O(10) total)) web front- ends Internet (O(1000) total) storage back-ends Load-balancing switch to paired backup site user queries/ responses

Content architecture Load-balancing switch paired client service proxies (14 total) (100 total) data storage servers metadata servers Internet to paired backup site user queries/ responses

Operator case study #1 Symptom: postings to internal newsgroups are not appearing Reason: news email server drops postings Root cause: operator error –username lookup daemon removed from news email server Lessons –operators must understand high-level dependencies and interactions among components –online testing »e.g., regression testing after configuration changes –better exposing failures, better diagnosis, …

Operator case study #2 Symptom: chat service stops working Reason: service nodes cannot connect to (external) chat service Root cause: operator error –operator at chat service reconfigured firewall; accidentally blocked service IP addresses Lessons –same as before, but must extend across services »operators must understand high-level dependencies and interactions among components »online testing »better error reporting and diagnosis –cross-service human collaboration important

Improving detection and diagnosis Understanding system config. and dependencies –operator mental model should match changing reality –including across administrative boundaries Enabling collaboration –among operators within and among service Integration of historical record –past configs., mon. data, actions, reasons, results (+/-) –need structured expression of sys. config, state, actions »problem tracking database is unstructured version Statistical/machine learning techniques to infer misconfiguration and other operator errors?

Reducing operator errors Understanding configuration (previous slide) Impact analysis Sanity checking –built-in sanity constraints –incorporate site-specific or higher-level rules? Abstract service description language –specify desired system configuration/architecture –for checking: high-level config. is form of semantic redundancy –enables automation: generate low-level configurations from high-level specification –extend to dynamic behavior?

The operator problem Operator largely ignored in designing server systems –operator assumed to be an expert, not a first-class user –impact: causes failures & extends TTD and TTR for failures –more than 15% of problems tracked at Content pertain to administrative/operations machines or services More effort needed in designing systems to –prevent operator error –help humans detect, diagnose, repair problems due to any cause Hypothesis: making server systems human-centric –reduce incidence and impact of operator error –reduce time to detect, diagnose, and repair problems The operator problem is largely a systems problem –make the uncommon case fast, safe, and easy

Failure location by % of incidents 77% front-end 3% back-end 18% net 2% unk. Online 10% back-end Content 66% front-end 18% net 4% unk. 11% back-end 81% net 9% unk. ReadMostly

Summary: failure location For two services, front-end nodes largest location of service failure incidents Failure location by fraction of total TTR was service-specific Need to examine more services to understand what this means –e.g., what is dependence between # of failures and # of components in each part of service

Operator case study #3 Symptom: problem tracking database disappears Reason: disk on primary died, then operator re-imaged backup machine Root cause: hardware failure; operator error? Lessons –operators must understand high-level dependencies and interactions among components »including dynamic system configuration/status know when margin of safety is reduced hard when redundancy masks component failures –minimize window of vulnerability whenever possible –not always easy to define what is a failure

Difficulties with prob. tracking DB’s Forms are unreliable –incorrectly filled out, vague categories, single cause, … –we relied on operator narrartives Only gives part of the picture »better if correlated with per-component logs and end-user availability »filtering may skew results operator can cover up errors before manifests as a (new) failure => operator failure % is underestimate »only includes unplanned events

What’s the problem? Many, many components, with complex interactions Many failures –4-19 user-visible failures per month in “24x7” services System in constant flux Modern services span administrative boundaries Architecting for high availability, performance, and modularity often hides problems –layers of hierarchy, redundancy, and indirection => hard to know what components involved in processing req –asynchronous communications => may have no explicit failure notification (if lucky, a timeout) –built-in redundancy, retry, “best effort” => subtle performance anomalies instead of fail-stop failures –each component has its own low-level configuration file/mechanism => misunderstood config, wrong new config (e.g., inconsistent)

Failure timeline normal operation service QoS significantly impacted (“service failure”) service QoS impacted negligibly problem in queue for diagnosis problem in queue for repair component in repair comp. fault comp. failure failure detected diag. completed repair initiated repair completed problem in diagnosis diag. initiated component failure failure detected diagnosis completed repair completed

Failure mitigation costs technique implement. cost reliability cost perf. impact online correctness testing DBB expose/monitor failures CAA redundancy AAA configuration checking CAA online fault/load injection FFD component isolation CAC pre-deploy. fault/load inject FAA proactive restart AAA pre-deploy. correctness test DAA

Failure cause by % of TTR Online 76% operator 1% net 6% node hw 17% node sw 1% node unk Content 75% operator 19% net 6% node sw ReadMostly 3% operator 97% net

Failure location by % of TTR front-end back-end network 69% 17% 14% 36% 61% 3% 1% 99% Online (FE:BE 100:1) Content (FE:BE 0.1:1) ReadMostly (FE:BE 1:100)

Geographic distribution 1. Online service/portal 3. High-traffic Internet site 2. Global storage service

Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University.

Similar presentations

Presentation on theme: "Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University.

Similar presentations

Presentation on theme: "Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University."— Presentation transcript:

Similar presentations

About project

Feedback