Download presentation
Presentation is loading. Please wait.
Published byChad Snow Modified over 8 years ago
1
“Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“ Oğuzhan YILDIRIM – Erkin GÜVEL Boğaziçi University Computer Engineering Department oguzhan.yildirim@boun.edu.tr erkin.guvel@boun.edu.tr
2
Introduction Masking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. Nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program executions converge to states from where programs continually (re)satisfy their specification.
3
Objectives We will show a practical method to design masking fault-tolerance is to first design nonmasking fault-tolerance and then transform the nonmasking fault-tolerant program minimally so as to achieve masking fault-tolerance
4
Novel method for the design of “masking” fault-tolerant system Actions –Critical –Noncritical Overview on Methodolgy Case Study Novel Method Critical – Noncritical Overview Case Study Outline
5
The Importance It is often simpler and cheaper to design nonmasking fault-tolerance than to design masking fault tolerance. It is often simpler and cheaper to design safe programs or programs with well-defined failure than to design masking fault-tolerant programs
6
Critical Actions Critical actions are those actions whose execution in the presence of faults can violate the system specification. –Database transactions, the actions that produce an output or commit a result are critical.
7
Noncritical Actions The execution of noncritical actions should not necessarily have to mask faults; in other words, when noncritical actions execute, the system state may be “unsafe”. The execution of the noncritical actions in unsafe states should not allow the system to remain in unsafe states forever, otherwise the system will never execute its critical actions.
8
Overview First Stage: The system is designed so that after faults stop occurring, subsequent execution of the system actions guarantees that the system reaches a safe state. Second Stage: the critical actions are modified so that their execution always masks faults.
9
First Stage In this stage, first, a nonmasking fault-tolerant version of the program is designed. Then, certain actions of the nonmasking fault- tolerant program are distinguished as being critical. No specific approaches, many acceptable methods exist.
10
First Stage (Cont.) To design the tolerance requirement hand-in- hand with the other requirements of the program. Transform an existing faultin tolerant program into one that is nonmasking faulttolerant.
11
Second Stage In this stage, first, a “safe predicate” is identified for each critical action. Then,each critical action is augmented, so that it is executed only in states where its safe predicate holds. Finally, the augmentation is shown to itself mask the effects of faults. The resulting program is masking fault tolerant. No specific approaches, many acceptable methods exist.
12
Second Stage (Cont.) Add actions that check whether the program state satisfies the state predicate, and allow execution to proceed only when the check succeeds. To enforce real-time constraints on the execution of critical actions.
13
Application Case Study:Leader Election System Logic Arora’s Program: Spanning tree Leader Election Study
14
System Logic A system consists of processes, that have unique integer ids, and channels, that each connect a unique pair of nodes. At any instant, each process is either “up” or “down”. Systems are subject to fail-stop and repair of processes.
15
Arora’s Nonmasking Program Arora’s nonmasking fault-tolerant program for distributed maintenance of a rooted spanning tree. Specifically, it allows faults to yield program states where there are multiple trees and unrooted trees. To deal with unrooted trees, the program has actions that inform all processes in unrooted trees that they have no root process.
16
Leader Election Problem The action that declares a process to be the leader. A unique process is to be elected as the leader; at no point during election may multiple processes declare themselves as leaders. And the purpose is to design a masking fault tolerant program for leader election.
17
Leader Election Problem Our tree maintenance program elects a unique process as leader. However, in the presence of faults, our tree maintenance program allows multiple processes to declare themselves as leaders.
18
Defining Critical Action In keeping with the proposed method, we proceed by identifying the critical actions in the nonmasking fault-tolerant tree maintenance program.. After this the identification the non-masking fault tolerated program is augmented to result in a masking fault tolerated system.
19
Defining Critical Action(Cont.) The critical actions in the tree maintenance program are the actions that elect a process as leader. This action is safely executed only in states where no process is elected as leader
20
Section 2: Checking that the critical action is executed in a safe state. And to guarantee the critical action is implemented in a masking-fault tolerant way.
21
Checking Critical Action A diffusing computation is used to check whether the critical action is executed in a safe state. This diffusing computation verifies the safe statement requirement by reaching all other processes and determines that no process is leader.
22
Diffusing Computation The diffusing computation we design consists of two phases: “propagate” and “complete“. The computation extends in an up-down manner: Upon receiving a diffusing computation from its parent in the tree. a process enters the propagate phase, and propagates the computation to all of its neighbors. Upon receiving a response from all of its neighbors, the process sends a response to its parent and reverts its phase to complete.
23
Masking The Critical Action If the child falls in a fail-stop fault, let the parent has a premature result with the value=false. Then create a new diffusing computation by assigning a sequence number to it. This way masking is done via redundancy of diffusing method.
24
Fail-Stop Repair Recomputation in case of a fault ROOT waiting for answer FAULT!!! Diffusing computation
25
Conclusion In this presentation, we presented a novel method for designing masking fault-tolerant programs. First, a nonmasking fault-tolerant program was designed to ensure that once faults stop occurring the program eventually reaches a safe state. Then, a masking component was designed to ensure that the composite program is masking fault-tolerant.
26
References Designing Masking Fault-tolerance via Nonmasking Fault- tolerance,Department of Computer and Information Science The Ohio State University, Columbus, Ohio B. Randell. System structure for software fault tolerance. IEEE Transactions on Software Eng., pages 220-232, 1975. J.-C. Laprie. Dependable computing and fault tolerance: Concepts and terminology. Proceedings of the 15th International Symposium on Fault-Tolerant Computing, pages 2-11, 1985. Internet Research
27
Thanks For Listening… Any Questions ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.