Download presentation
Presentation is loading. Please wait.
Published byZoe Weast Modified over 9 years ago
1
Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery- oriented computing (ROC), HPTS, 2001 A little of … A. B. Brown and D. A. Patterson, Undo for operators: Building an undoable e-mail store, USENIX ATC 2003 (Best paper)
2
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 2 Availability and today’s apps Availability is the most important metric for modern computer systems Availability used to be a solved problem –Expensive fault-tolerance server –Vendor-supplied high-availability database system –All behind a box well firewalled Today’s apps are quire different –Distributed, heterogeneous environment –Conglomeration of interconnected systems: databases, application servers, middleware, web servers So – 65% of surveyed sties suffered a customer- visible outage at least once in 6-month; 25% 3+ in same period
3
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 3 Problem with assumptions Basic model –Hardware and software can be built w/ negligible failure rates –Failure modes of systems can be predicted and tolerated –Maintenance and repair are error-free procedures More realistically –Hardware and software failures are inevitable –Human failures are inevitable –Unanticipated failures are inevitable Your only option – get used to it – embraced failure – Recovery Oriented Computing (ROC)
4
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 4 HW & SW failures are inevitable Software: Functionality is king – a constant race to offer new functionality → sloppy people & buggy code Hardware: razor-thin margins means no $ for high-quality, fault-tolerant hardware → commodity, failure-prone, hardware Scale only multiplies the problem!
5
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 5 Human failures are inevitable Large systems rely on human beings for –Maintenance and repair –Software configuration and upgrading –Performance tuning –Diagnosing and fixing failures Human beings make mistakes –At a rate of 10-100% under stress –70% of failures in electronic systems, 20-53% in missile systems, 60-70% in aircraft failures, 50% in VAX systems, 42% in Tandem systems, …. But modern systems do not into account the possibility of human failure
6
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 6 Unanticipated failures are inevitable Could you solve this w/ good engineering? –Not really Perrow’s work on high-risk technology –Large servers - complex, reasonably-tightly- coupled systems, performing complex tasks under human guidance … prone to “normal accidents” –Accidents that arise from the multiple and unexpected hidden interactions of smaller failures and recovery systems designed to handle them
7
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 7 Recovery Oriented Computing Focus on repair instead of avoiding failures Recovery needs to be a first-class part of the system It must –Ensure problems are detected fast (for containment) –Provide assistance in diagnosing root-cause of them –Repair mechanisms should be trustworthy –Should tolerate errors during recovery –It’s really complementary to fault-tolerance (redundancy is thus necessary) –Should automatically track the health of all components – so it should include fault-injection mechanisms –…
8
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 8 Undoable e-mail store You have undos for Office, but not for admins?! Undo operator incorporates three steps –Rewind – physically rolled back to before the damage –Repair – not constraint admins on what repair they can do –Replay – logically (to incorporate the repair) bring it back Two challenges in the 3Rs model –Timeline management – record system timeline so that you can edit it during repair and re-execute during replay –Keep the system consistent from an external observer’s point of view (even ‘after’ repair)
9
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 9 Undo system architecture User Undo Proxy Service App Time travel storage Timeline log Undo Manager Control UI Control Verbs To be able to roll-back the system Service specific In part to make the undo manager generic
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.