Embracing Failure: A Case for Recovery-Oriented Computing

Embracing Failure: A Case for Recovery-Oriented Computing
Aaron B. Brown David A. Patterson Presented by John Calandrino

Motivation A survey in 2000 (one year prior to writing of this paper) found: 65% of surveyed web sites had customer-visible downtime at least once every 6 months, 25% had downtime 3+ times Is this “five-nines” availability? 259,200 minutes in 180 days “Five-nines” no more than ~2.5 minutes downtime (barely “customer visible”)

In modern computer systems…
Availability is more important than ever Businesses can lose millions of dollars during a one hour web site outage Availability is harder than ever to guarantee Modern systems are distributed, heterogeneous, and complex, involving numerous interacting applications: web server, internal database, etc. Availability limited by “weakest link” in system In such an environment, failures are inevitable

Traditional Solutions
Use fault-tolerant components Employ rigorous software testing practices Such solutions rely on outdated assumptions: We can design hardware/software to have negligible failure rates Maintenance and repair are error-free We can predict and tolerate system failure These assumptions emphasize failure avoidance rather than failure recovery Such systems are unprepared when failures occur

Hardware/Software Failures
Fault-tolerant hardware may exist, but that does not mean it is used… Commodity hardware is cheap and ubiquitous And error-prone: IDE disks, non-ECC memory, etc. Even low per-node failure rates are substantial in larger clusters (e.g., Google cluster) It may be possible to develop fault-tolerant software, however… Software is being developed, updated, and deployed faster than ever in the Internet age “In Internet time, people get sloppy”

Human Failures Arise primarily during maintenance and repair
Consider trying to diagnose and fix a subtle bug in even a few thousand lines of code Also arise during other activities: configuration, upgrading, performance tuning Human error rates are nowhere near zero Even highly-trained, intelligent people make mistakes, especially under pressure Therefore, maintenance and repair are not error-free

Unanticipated Failures
Some failures cannot be anticipated Humans are good at breaking systems, especially unintuitive ones Systems are combined in unanticipated ways, generating unexpected interactions Generate “normal accidents” In this environment, it is impossible to predict all types of failures

Recovery-Oriented Computing
As we cannot design a system with 100% availability, modern systems must accept failure as inevitable Focus more on recovery and repair in addition to avoidance Provides an essential failure “safety net” that complements failure avoidance methods Focus on improving MTTR as well as MTTF

Recovery-Oriented Computing
ROC relies on a system-integrated recovery-oriented framework that should Detect and repair failures as quickly as possible Prevent propagation of errors through system Helpful to have physically-partitioned system Must tolerate errors during recovery/repair Be “trustworthy” (seems to imply: low failure rate) Extensive (self-)testing of framework, failure “fire drills” What about unanticipated failures? Is availability a requirement of this framework? How do we implement such a framework?

Questions What if there are errors in the recovery-oriented framework?
How are these failures handled? Alternately, can the framework be guaranteed not to fail? Probably could not be a 100% guarantee If this is instead a “five-nines” style of guarantee, aren’t we back where we started? ROC not be the “catch all” safety net that we desired

Embracing Failure: A Case for Recovery-Oriented Computing

Similar presentations

Presentation on theme: "Embracing Failure: A Case for Recovery-Oriented Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Embracing Failure: A Case for Recovery-Oriented Computing

Similar presentations

Presentation on theme: "Embracing Failure: A Case for Recovery-Oriented Computing"— Presentation transcript:

Similar presentations

About project

Feedback