Download presentation
Presentation is loading. Please wait.
Published byVictor McGee Modified over 8 years ago
1
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems, Copyright Wenbing Zhao 1
2
Wenbing Zhao Outline Basic terminology Dependability concepts Attributes Fault, error, and failure Approaches to achieving dependability
3
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao Terminology A system is an entity that interacts with other entities, i.e., other systems, including hardware, software, humans, and the physical world with its natural phenomena These other systems are the environment of the given system The system boundary is the common frontier between the system and its environment A system may consists of one or more components, such as nodes or processes System Environment System Boundary
4
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao Terminology State: determines the status of the system A system may be recovered to where it was before a failure if its state was captured and survives the failure Service delivered by a system: work done that benefits its users User/Client: another system that interacts with the former Function of a system: what the system is intended to do (Functional) Specification: description of the system function Correct service: when the delivered service implements the system function
5
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao Dependability and its Attributes Dependability refers to the ability of a distributed system to provide correct services to its users despite various threats to the system such as undetected software defects, hardware failures, and malicious attacks A dependable system has the following attributes Availability: a measure of the readiness of the system Reliability: a measure of the system’s capability of providing correct services continuously for a period of time Integrity: the capability of the system to protect its state from being compromised due to various threats Maintainability: the capability of the system to evolve after it is deployed Safety: when the system fails, it does not cause catastrophic consequences
6
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao Quantitative Dependability Measures Availability - a measure of the readiness of the system It is the probability of being operational at a given instant of time A 0.999999 availability means that the system is not operational at most one hour in a million hours A system with high availability may in fact fail. However, failure frequency and recovery time should be small enough to achieve the desired availability Soft real-time systems such as telephone switching and airline reservation require high availability
7
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao
8
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao Quantitative Dependability Measures Reliability - a measure of continuous delivery of correct service. It is the probability of surviving (potentially despite failures) over an interval of time May also be evaluated as time to failure For example, the reliability requirement might be stated as a 0.999999 availability for a 10-hour mission. In other words, the probability of failure during the mission may be at most 10 -6 Hard real-time systems such as flight control and process control demand high reliability, in which a failure could mean loss of life
9
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao Fault, Error, and Failure The adjudged or hypothesized cause of an error is called a fault An error is a manifestation of a fault in a system, in which the logical state of an element differs from its intended value A service failure occurs if the error propagates to the service interface and causes the service delivered by the system to deviate from correct service The failure of a component causes a permanent or transient fault in the system that contains the component Service failure of a system causes a permanent or transient external fault for the other system(s) that receive service from the given system
10
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao Fault Faults can arise during all stages in a computer system's evolution - specification, design, development, manufacturing, assembly, and installation - and throughout its operational life Most faults that occur before full system deployment are discovered through testing and eliminated Faults that are not removed can reduce a system's dependability when it is in the field A fault can be classified by its duration, nature of output, and correlation to other faults (and many other criteria)
11
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao Fault Types - Based on Duration Permanent faults are caused by irreversible device/software failures within a component due to damage, fatigue, or improper manufacturing, or bad design and implementation Permanent software faults are also called Bohrbugs Easier to detect Transient/intermittent faults are triggered by environmental disturbances or incorrect design Transient software faults are also referred to as Heisenbugs Study shows that Heisenbugs are the majority software faults Harder to detect
12
Building Dependable Distributed Systems, Copyright Wenbing Zhao Wenbing Zhao Fault Types - Based on Nature of Output Malicious fault: The fault that causes a unit to behave arbitrarily or malicious. Also referred to as Byzantine fault A sensor sending conflicting outputs to different processors Compromised software system that attempts to cause service failure Non-malicious faults: the opposite of malicious faults Faults that are not caused with malicious intention Faults that exhibit themselves consistently to all observers, e.g., fail-stop A fail-stop system simply stops executing once it fails Malicious faults are much harder to detect than non-malicious faults
13
Wenbing Zhao Fault Types - Based on Correlation Components fault may be independent of one another or correlated A fault is said to be independent if it does not directly or indirectly cause another fault Faults are said to be correlated if they are related. Faults could be correlated due to physical or electrical coupling of components Correlated faults are more difficult to detect than independent faults Building Dependable Distributed Systems, Copyright Wenbing Zhao
14
Wenbing Zhao Approaches to Achieving Dependability Fault Avoidance - how to prevent, by construction, the fault occurrence or introduction Fault Removal - how to minimize, by verification, the presence of faults Fault Tolerance - how to provide, by redundancy, a service complying with the specification in spite of faults Fault Forecasting - how to estimate, by evaluation, the presence, the creation, and the consequence of faults
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.