Practical Reports on Dependability
Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption Slow down
PAGE UNAVAILABLE
System Exception
Performance Slowdown
DOWNTIME 15% contribution
DOWNTIME unplanned 20 % planned 80 %
DOWNTIME
UNPLANNED DOWNTIME
Software Errors Triggers Resource exhaustion Logical errors System Overload Recovery code Failed upgrade
Logical Error
SYSTEM OVERLOAD
Operator Errors Triggers Configurational –Incorrect parameter setting Procedural –Omit/inncorect maintainance action Miscellaneous
FAILURE DURATION Short (minutes) Long (weeks) –Implies large fault chains FREQUENCY Permanent (down until problem fixed) Transient (resolves without intervention) Intermittent (trasient + occasional) SCOPE Entire system Parts of the System
Fault Chains ”the series of component failures that led up to a user- visible failure” Uncoupled –Independent failures Tightly Coupled –Cascading/corelated failure
Non-Malicious Software Failure Most Common Causes –Routine maintenance –Software upgrade –System integration Other Causes –System overload –Resource exaustsion –Complex fault tolerant routines
”ROUTINE” MAINTAINANCE Danske Bank 2003 –March 11: routine operation to replace a defective electrical unit in IBM DB2 disk system –System failure: Disks becomes inaccessable –6 hours later: system restarted –March 12: Batch systems running incorrectly –Three More errors discovered: 1.Recovery process on several tables won’t start 2.Recovery jobs won’t run symultaneously 3.Recovery jobs can’t reastablish data in tables –March 14: All data recovered and system functional