Middleware for Fault Tolerant Applications

Middleware for Fault Tolerant Applications
Lihua Xu and Sheng Liu Jun, 05, 2003

Outline Basic technologies in fault tolerance
Middleware for fault tolerant applications Egida AQuA

Why Fault Tolerance? “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” Leslie Lamport, May 1987

Basic Technologies in Fault Tolerant Distributed Systems
Hardened hardware component technologies Fault detection and membership maintenance Log-based scheme and checkpointing

Hardened hardware component technologies
Hardened processor modules: Pair of self-checking processors (PSP), RAID ( redundant array of inexpensive disks): Popular even in database-centric business computing applications.

Fault detection and membership maintenance
Timeout Comparison of the results of repeated or redundant executions Error-detection and error-correction code Acceptance test : Test reasonableness of intermediate computation results Membership maintenance Simplest version: Master node makes a periodic roll-call of other nodes Heartbeat message exchange

Log-based scheme and checkpointing
Log-based schemes record, on stable storage, information describing all the modifications by the transaction to the various data it accessed. Checkpointing is a technique to minimize the time taken to recover in the event of a system crash.

Middleware for Fault Tolerant Applications

Egida It is an object-oriented toolkit designed to support transparent rollback recovery for low-overhead fault-tolerance.

Log-based rollback recovery protocols
Log information are recorded on stable storage during failure free executions Use that information to recover after a failure The protocols have a set of variant, including checkpointing and message logging.

Checkpointing

Message Logging Pessimistic logging allows processes to communicate only from recoverable states . Optimistic logging allows processes to communicate with other processes even from states that are not yet recoverable. Causal logging allows the possibility that a state from which a process communicates may become unrecoverable because of a failure, but only if no correct process depends on that state. A correct process is one that exhibits no failures at any point in the execution under consideration. So a process that crashes at some point is “non-failed” before that point, but is not “correct” before that point.

Deconstructing Log-Based Rollback-Recovery Protocols
The diversity of rollback-recovery protocols reflects the heterogeneity in the requirements of applications. This diversity shows a simple event-driven structure that all these protocols share and that all protocols are interested in the same set of “relevant” events.

Relevant Events Non-deterministic events Dependency-generating events
A non-deterministic event is an event whose outcome may change for different executions of the same program. Dependency-generating events These events can increase the number of processes that depend on the nondeterministic events executed by a process. Output-commit events These events can make the external environment depend on the non-deterministic events executed by a process. Checkpointing events These events instruct the protocols to write to stable storage the state of one or more processes. Failure-detection events These events are generated on detecting the failure of one or more processes.

A Simple Language Specifying Rollback-recovery Protocols
A protocol is defined in terms the actions it takes in response to non-deterministic events, dependency generating events, output commit events, checkpointing events and failure-detection events. Implementing a specific protocol is equal to selecting the set of actions performed in response to each relevant event. A simple language is used to specify the rollback-recovery protocols.

Module Definitions To define a protocol completely, it is necessary to instantiate a set of variables which specify, for instance, the set of non-deterministic events, the form of their determinant, the implementation of stable storage, etc. Egida identifies a set of building blocks which are incorporated into the protocol structure yield different rollback recovery protocols.

Architecture

Synthesizing Protocols through Module Composition
Egida allows the co-existence of multiple implementations for each of the modules. To synthesize a protocol, a specific implementation of each module must be selected. Egida maintains a binding between the values for the modules and their corresponding implementations. Therefore, synthesizing a protocol requires processing the specification along with the binding information to initialize the modules to their appropriate implementations.

Advantages Promote extensibility and flexibility by allowing multiple implementation of each of the core functionalities. Facilitate rapid implementation of rollback recovery protocols with minimal programming effort by gluing together objects from the available library of building blocks. Egida enables designers of fault-tolerance protocols to develop new rollback recovery protocols by combining different implementations of the core functionalities in novel ways.

AQuA: An Adaptive Architecture that provides dependable distributed objects

Overview To allow distributed applications to request and obtain a desired level of availability using a QuO contract through a property manager. Fault tolerance in AQuA is provided by Proteus, which dynamically manages the replication of distributed objects to make them dependable.

Background Ensemble group communication system Maestro
1. ensure reliable communication between groups of processes, 2. ensure atomic delivery of multicasts to groups with changing membership, 3. detect and exclude from the group members that fail by crashing. Maestro Object-oriented interface to Ensemble The Ensemble protocol stacks used in AQuA provide inter-process communication based on the virtual synchrony, both total and causal multicast are used in the AQuA group structure, resulting in a total order of delivered messages between different groups of replicated objects. Ensemble detects process failures through dummy message “I am alive”. Object-oriented applications can be written by deriving from Maestro classes that provide reliable communication.

Background (cont) Quality Objects 1. transmit applications’ availability requirements to Proteus, which attempts to configure the system to achieve the desired availability. 2. provide an adaptation mechanism that is used when Proteus is unable to provide a specified level of availability. QuO allows distributed object-oriented applications to specify dynamic QoS requirements at the application level, using the notion of “contract”, which is a finite state machine specifying actions to be taken based on the state of the distributed system and the desired requirements of the application. The goal of QuO is to develop a common middleware framework, based on distributed object computing, that can manage and integrate non-functional system properties such as network resource constraints, availability requirements and security needs.

Background (cont) Proteus ♠ dependability manager ♠ handlers
replicated consists of advisor and protocol coordinator ♠ handlers implement voters and monitors in the gateway ♠ object factories implemented on each host

AQuA Architecture Overview

Group Structure in AQuA
AQuA use one general object model, based on interactions between objects that can be replicated. Objects can initiate requests (acting as clients) and respond to requests (acting as servers). Replication groups Connection groups PCS (Proteus Communication Service) group Point-to-Point groups

Fault Tolerance in AQuA
Fault Model crash failures, value faults, time faults Error Detection Proteus, voter, monitor Fault Treatment Proteus manager advisor Crash failure occurs when an object stops sending out messages and when the internal state is lost. The crash failure of an object is due to the crash failure of at least one element composing the object. KILL the process as well as gateway process. Value failure occurs when the message arrives in time but contains the wrong content. Proteus deal it. Time fault includes delay and omission faults. Handler in each gateway tolerate it. The detection mechanism used by Proteus to detect crash failures is based on the detection mechanism implemented in Ensemble. Value faults are detected by voters, which is implemented in the gateway. Time errors are detected by monitors that record info regarding various times and omissions.

Middleware for Fault Tolerant Applications

Similar presentations

Presentation on theme: "Middleware for Fault Tolerant Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Middleware for Fault Tolerant Applications

Similar presentations

Presentation on theme: "Middleware for Fault Tolerant Applications"— Presentation transcript:

Similar presentations

About project

Feedback