Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed systems. Requirements for dependable systems Availability: the probability that the system is available to perform its functions at any moment 99.999 % availability (five 9s) 5 minutes of downtime per year Reliability: the ability of the system to run continuously without failure Down for 1ms every hour 99.9999 % availability but highly unreliable Down for two weeks every year high reliability but only 96% availability Safety: when a system temporarily fails to operate correctly, nothing catastrophic happens Maintainability: how easily a failed system can be repaired Security: will cover in Chapter 9 Topics of ch8: Process resilience, Reliable communication, Failure recovery, Distributed commit Availability - Readiness for usage, Reliability - Continuity of service delivery. Example: control system for airplanes, nuclear power plants. Safety - Very low probability of catastrophes, Maintainability - How easy can a failed system be repaired Dependability is the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers Dependability attributes Worthy of confidence, confident about relying on its service Attributes - A way to assess the Dependability of a system Need to prevent failure which is caused by faults. Fault tolerance means that a system can provide its services even in the presence of faults
Failures and Faults Building a dependable system comes down to preventing failures A failure of a system occurs when the system cannot meet its promises Failures are caused by faults. A fault is an anomalous condition. There are three categories of faults: Transient faults: Occur once and never reoccur (e.g., wireless communication being interrupted by external interference) Intermittent faults: Reoccur irregularly (e.g., a loose contact on a connector) Permanent faults: Persist until the faulty component is replaced (e.g., software bugs)
Types of Failures Fail-stop: server will stop in a way that clients can tell that it has halted Fail-silent: clients do not know server has halted State transition failure: Execution of component brings it into a wrong state Arbitrary failures are also known as Byzantine failures
Fault Tolerance In a single-machine system, a failure is almost always total All components are affected and entire system may be brought down (e.g., OS crash, disk failures) In distributed systems, partial failures are possible When one component fails, it may affect some components, while leaving other components unaffected Fault tolerance means that a system can provide its services even in the presence of faults Fault tolerance requires preventing faults and failures from affecting other components of the system automatically recovering from partial failures DS: multiple independent nodes, Prob(failure) = Prob(any one component fails)
Failure Masking Failure masking is a fault tolerance technique that hides occurrence of failures from other processes The most common approach to failure masking is redundancy Three types of redundancy: Information redundancy: add extra bits to allow recovery from garbled bits Time redundancy: repeat an action if needed Physical redundancy: add extra equipment or processes so that the system can tolerate the loss or malfunctioning of some components RAID disks and backup name servers are examples of physical redundancy.
An Example of Physical Redundancy place voters after each stage to pick the majority outcome of the stage. The voter is responsible for picking the majority winner of the three inputs. (a) No redundancy. (b) Triple modular redundancy: the effect of a single component failing is completely masked.
Process Resilience By organizing several identical processes into a group, we can mask one or more faculty processes in that group A group of replicated processes is said to be k fault tolerant if it can survive k faults and still meet its specifications Assume all requests arrive in the same order at all servers in a process group (this requires the use of atomic multicast) With crash failures, K+1 processes are sufficient to survive k faults With Byzantine failures, processes may produce erroneous, random, or malicious results 2k+1 processes are required to survive k faults (the client just believes the majority) Protection against process failures can be achieved by organizing several identical processes into a group only then are we sure that all members do exactly the same thing processes run even if sick
Agreement in Faulty Systems Distributed processes often need to agree on something (e.g., elect a coordinator, commit a transaction) The goal of distributed agreement algorithms is to have all the non-faulty processes reach consensus on some issue within a finite number of steps Can consensus be reached with non-faulty processes and unreliable communication channel? Answer: No! Can consensus be reached with faulty (Byzantine) processes and reliable channel? Answer: Depends Two-army problem: two blue armies must agree to attack simultaneously in order to defeat the white army Each blue army coordinates with a messenger Messenger can be captured by the white army Can the two blue armies reach agreement?
Conditions under which consensus is possible. (Assume processes may be faulty, communication is reliable) Process behavior Message Order Communication delay Unordered Ordered Asynchronous Yes Unbounded Bounded Synchronous Unicast Multicast Message Transmission A system is synchronous iff the processes operate in a lock-step mode (i.e., there is a constant c ≥ 1, such that if any process has taken c+1 steps, every other process has taken at least one step).
Byzantine Agreement Problem Byzantine agreement problem: Can N generals reach consensus about each other’s troop strengths when communication channel is perfect but some of the generals are traitors and will lie to prevent agreement? Formally, there are N processes, each process i will provide a value vi to the others. The goal is to let each process construct a vector V of length N, such that if process i is non-faulty, V[i]= vi. Otherwise V[i] is undefined. Assume processes are synchronous, messages are unicast while preserving ordering, and communication delay is bounded, with k faulty processes, agreement can be achieved if there are 2k+1 non-faulty processes [Lamport et al., 1982]. there are n army generals who head different divisions. but m of the generals are traitors (faulty) and are trying to prevent others from reaching agreement by feeding them incorrect information. The question is: can the loyal generals still reach agreement? This means that more than 2/3 of the generals must be loyal. In lamport’s paper, byzantine generals problem requires two conditions to be met: 1) all loyal lieutenants obey the same order 2) if the commanding general is loyal, then every loyal lieutenants obeys the order he sends
The Byzantine agreement problem for 3 non-faulty processes and 1 faulty process with vi=i. Consensus is reached for the non-faulty processes. (a) Each process sends its value to the others. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives after each process passes its vector from (b) to every other process.
The Byzantine agreement problem for 2 non-faulty processes and 1 faulty process. The algorithm fails to produce agreement.
Process Resilience Protection against process failures can be achieved by organizing several identical processes into a group Flat group: all process are equal; the processes make decisions collectively No single point of failure, but decision making is more complicated Hierarchical group: a single coordinator makes all decisions Decision making is simpler, but coordinator is a single point of failure Group is transparent to its users, the whole group is dealt with as a single process