Response to Undesired Events in Software Systems Kimberly Hanks and Phil Varner A Presentation brought to you by David Parnas
Undesired != Unexpected Undesired Events are: –Deviation from the ideal operation of the system –Not always correctable like errors –A fact of life we have to deal with “Correct programs”, in the sense that would make UE handling unnecessary, do not exist
UE Handling Overview Problem: Even with "correct" programs, UEs at runtime will occur To deal with them, we need to: –Know what to look for –Successfully diagnose –Recover if possible This paper proposes how
Problems with Perfect World Everyone makes errors Machines fail, causing programs/data to fail Programs change, new errors pop up Incorrect or inconsistent data may be supplied
Where Do We Start? Parnas assumes systems that are structured according to good information hiding and the “uses” hierarchy This shapes all of his proposals In particular, UE detection and handling is predicated on a system of levels
Levels of Detection and Handling The first clue that something is wrong can appear at a level other than where an UE originates Example: –Initiate a read on a storage resource, e.g., a tape block, which turns out to be bad –Detection occurs at the HW level when the read can’t be completed, even though initiated from some high-level application
Levels…(2) Should the HW be responsible for a recovery attempt? Parnas says no—but where should it be handled? –At the originating level Why?
Levels…(3) The originating level is “where the knowledge is” –The failed read happens at the HW level, but the HW doesn’t know any useful implications –The level where the read was initiated sits on a VM that provides certain abstractions to the user –The UE is only meaningful in the context of these abstractions
Levels…(4) What would meaningful handling look like? –A diagnostic stated in the abstractions of the level –A provision of an alternative, in the context of those abstractions
UE Handlers and Info Hiding We want to handle the UE at the level at which it is meaningful, but… This doesn’t necessarily mean the information necessary to handle it is available (it may have been abstracted away to effect good information hiding) How should we manage this tradeoff?
UE Handlers and Info Hiding (2) “Everything should be made as simple as possible, but not simpler.” –Einstein Hide all and only information which is not likely to be useful in diagnosing and recovering from UEs Prediction is the key
Meta-structure The general policy proposed constitutes a meta-structure of UE handling It has several advantages –Handlers don’t violate info-hiding –“Uses” hierarchy is intact –Allows refinement without major revision –Aids debugging
Separation of Case and EH/R Separation of Normal Case and Error Handling/Recovery Java: try{} catch{}
Suggestion 1 Assign responsibility for the detection of attempts to violate its specification to the "abstract machine” –trap metaphor - hide detection mechanism, expose interface for handling UEs –should be able to handle errors in context of VM abstraction
Degrees of Recovery Hardware - handle or crash Instead, design for multi-level recovery Policy determined by cost and aim
No Recovery local attempts: INTEGER do if attempts < Max_attempts then last_character := low_level_read_function (f) else failed := True end
Simple Recovery local attempts: INTEGER do if attempts < Max_attempts then last_character := low_level_read_function (f) else failed := True end rescue attempts := attempts + 1 retry end
Degrees of Recovery local attempts: INTEGER do if attempts < Max_attempts then last_character := low_level_read_function(f) else failed := True end rescue attempts := attempts + 1 if attempts == 1 retry elseif attempts == 2 sleep(2) retry elseif attempts == 3 flush_buffers sleep(4) retry else end
Suggestion 2 Do not specify a module to have properties which UEs will frequently violate Examples: –don’t use limited cap data structures when # of objects is unknown, etc. –don’t allow possibility of, e.g. calling pop() on empty stack
Error Indication Strongly "typed" errors - Java Limitations on values of parameters (Eiffel) Capacity limitations Requests for undefined information Restrictions on the order of operations (encapsulation?) Detection of actions which are likely to be unintentional (defined how?)
Error Indication II Sufficiency - ensure your module will work correctly or complain Priority of Traps - multiple error handling? Size of Trap Vector - how many commands in one trap try{}catch{} State after Trap - Atomicity Errors of Mechanism - tradeoff between simplicity and detail
Redundancy and Efficiency Redundant error checks slow the system Can often be removed in later versions –Retain upper level, remove lower –Retain lower, remove upper Which is best, and why?
Reliability/Dependability How does UE handling relate to reliability and dependability?
Conclusion Things go wrong UE detection and handling is good Must be correctly implemented to be useful