Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn
Dependability The extent to which a critical system is trusted by its users Dependability is usually the most important system property of a critical system A system does not have to be trusted to be useful Dependability reflects the extent of the user’s confidence that it will not fail in normal operation
Dimensions of Dependability Availability –ability of the system to deliver services when requested Reliability –ability of the system to deliver services specified Safety –ability of system to operate without catastrophic failure Security –ability of system to defend itself against intrusion
Maintainability Concerned with the ease of repairing a system after failure Many critical system failures are caused by faults introduced during maintenance Maintainability is the only static dimension of dependability, the other 3 are dynamic
Survivability Ability of a system to deliver services after a deliberate or accidental attack This is very important for distributed systems whose security can be compromised Resilience –ability of system to continue operation despite component failures
Dependability Costs Tend to increase exponentially as increasing levels of dependability are required More expensive development techniques and hardware are required to achieve higher levels of reliability Increased testing and validation are required to convince users that higher levels of dependability have been achieved
Dependability and Performance Untrustworthy systems are rejected by users System failure costs may be high It is hard to make existing systems more dependable It may be possible to compensate for poor performance Untrustworthy systems may lead to information loss
Dependability Economics Sometimes it is more cost effective to pay for failures than try to improve dependability having a reputation for products that can’t be trusted can lead to loss of business System trustworthiness levels depend on the system type being developed
Availability and Reliability Availability –probability of failure-free operation over a specified time period in a given environment for a given purpose Reliability –probability that a given system will be operational at a given point in time and able to deliver services
Comparing Availability and Reliability If a system is not available when it is needed it is unreliable It is possible to have systems with low reliability and high availability (if failures can be repaired quickly and do not damage data) Availability must take repair time into account
Faults and Failures Failures are usually the result of system errors derived from system faults Faults do not always result in system failure –a transient system state is corrected before error occurs Errors do not always leads to system failures –an error can be corrected by built-in error detection and recovery procedures –failure can be protected against by protecting system resources from damage
User’s Reliability Perceptions The formal definition of reliability may not reflect the user’s perception of reliability –the users environment may not match the developers assumptions about the application environment The consequences of failure affect the user’s perception of reliability –failures with serious consequences are given more weight by users than failures that are inconvenient
Reliability Achievement Fault Avoidance –development techniques that minimize the possibility of mistakes or reduce the consequences of errors Fault Detection and Removal –verification and validation techniques that increase the possibility of detecting and correcting errors before deployment Fault Tolerance –run-time techniques used to ensure system faults do not result in system error and system errors do nor result in system failures
Reliability Modeling You can model a system as an input-output mapping where some inputs lead to erroneous outputs The reliability of the system is the probability that a particular input lies in the set of inputs which cause erroneous outputs This probability is not static and depends on the system’s environment
Improving Reliability Removing X% of the system faults does not always improve system reliability –remember the 90/10 rule Program defects may lie in code rarely executed by the user, so removing them will do little to improve perceived reliability A program with known faults may still be perceived by its users as reliable
Safety System property that reflects the system’s ability to operate (normally or abnormally) without danger to system environment As more devices become software controlled, safety becomes a greater concern Safety requirements are exclusive (they exclude undesirable situations rather than specify required system services)
Safety Criticality Primary safety-critical systems –embedded software systems whose failure can cause associated hardware to fail and directly threaten people Secondary safety-critical systems –systems whose faults can cause other systems to fail which cause threaten people
Safety and Reliability They are related, but not identical Reliability –concerned with conformance to a specification and delivery of a service Safety –concerned with ensuring a system cannot damage, regardless of its conformance (or nonconformance) to its specification
Unsafe Reliable System Specification errors –if the specification is incorrect conformance to the specification can still cause damage Hardware failures generating spurious outputs –hard to anticipate in specification Context-sensitive commands –e.g. issuing the right command at the wrong time –often caused by operator error
Safety Achievement Hazard Avoidance –system design so some hazard cases can not arise Hazard Detection and Removal –system design so hazards are detected and removed before they result in an accident Damage Limitation –system includes protection features that minimize damage that may result from an accident
Accidents Rarely have a single cause in a complex system (e.g. credit assignment problem) Most accidents are the result of combinations of malfunctions Anticipating all combination of malfunctions may not be possible in a software controlled system, so complete safety may be impossible
Security Reflects a system’s ability to protect itself from attack Security is increasingly important when systems are networked to each other Security is an essential pre-requisite for availability, reliability, and safety
Fundamental Security If a system is networked and insecure then statements about it reliability and safety are unreliable Intrusion (attack) can change the system’s operating environment or data and invalidate the assumptions upon which the reliability and safety are made
Insecurity Damage Denial of Service –system forced into state where providing service is impossible or significantly degraded Corruption of Programs or Data –modifications made by unauthorized user Disclosure of Confidential Information –information managed by system is exposed to people who are not authorized users
Security Assurance Vulnerability Avoidance –system designed so vulnerabilities can not occur –e.g. no network connection Attack Detection and Elimination –system designed so attacks on vulnerabilities do not occur –e.g. use of anti-virus software Exposure Limitation –system designed so damage from attacks is minimal –e.g. a backup policy that allows restoration of damaged files