Middleware for Fault Tolerant Applications

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Reliability on Web Services Presented by Pat Chan 17/10/2005.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
8. Fault Tolerance in Software
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
EEC 688/788 Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Transactions and concurrency control
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
CH2 System models.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Distributed Systems: Concepts and Design Chapter 1 Pages
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
Distributed Transactions Chapter 13
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Fault Tolerance Prof. Orhan Gemikonakli
Replication & Fault Tolerance CONARD JAMES B. FARAON
Chapter 1: Introduction to Systems Analysis and Design
8.6. Recovery By Hemanth Kumar Reddy.
Prepared by Ertuğrul Kuzan
Self Healing and Dynamic Construction Framework:
Distributed Systems – Paxos
EEC 688/788 Secure and Dependable Computing
Distribution and components
Unit OS10: Fault Tolerance
Operating System Reliability
Operating System Reliability
Replication Middleware for Cloud Based Storage Service
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Chapter 2: System Structures
Outline Announcements Fault Tolerance.
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Replication Improves reliability Improves availability
EEC 688/788 Secure and Dependable Computing
Fault Tolerant Distributed Computing system.
EEC 688/788 Secure and Dependable Computing
Software Connectors.
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Introduction To Distributed Systems
Database System Architectures
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Computer Networks Protocols
Abstractions for Fault Tolerance
Chapter 1: Introduction to Systems Analysis and Design
Design.
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003

Outline Basic technologies in fault tolerance Middleware for fault tolerant applications Egida AQuA

Why Fault Tolerance? “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” ----------Leslie Lamport, May 1987

Basic Technologies in Fault Tolerant Distributed Systems Hardened hardware component technologies Fault detection and membership maintenance Log-based scheme and checkpointing

Hardened hardware component technologies Hardened processor modules: Pair of self-checking processors (PSP), RAID ( redundant array of inexpensive disks): Popular even in database-centric business computing applications.

Fault detection and membership maintenance Timeout Comparison of the results of repeated or redundant executions Error-detection and error-correction code Acceptance test : Test reasonableness of intermediate computation results Membership maintenance Simplest version: Master node makes a periodic roll-call of other nodes Heartbeat message exchange

Log-based scheme and checkpointing Log-based schemes record, on stable storage, information describing all the modifications by the transaction to the various data it accessed. Checkpointing is a technique to minimize the time taken to recover in the event of a system crash.

Middleware for Fault Tolerant Applications

Egida It is an object-oriented toolkit designed to support transparent rollback recovery for low-overhead fault-tolerance.

Log-based rollback recovery protocols Log information are recorded on stable storage during failure free executions Use that information to recover after a failure The protocols have a set of variant, including checkpointing and message logging.

Checkpointing

Message Logging Pessimistic logging allows processes to communicate only from recoverable states . Optimistic logging allows processes to communicate with other processes even from states that are not yet recoverable. Causal logging allows the possibility that a state from which a process communicates may become unrecoverable because of a failure, but only if no correct process depends on that state. A correct process is one that exhibits no failures at any point in the execution under consideration. So a process that crashes at some point is “non-failed” before that point, but is not “correct” before that point.

Deconstructing Log-Based Rollback-Recovery Protocols The diversity of rollback-recovery protocols reflects the heterogeneity in the requirements of applications. This diversity shows a simple event-driven structure that all these protocols share and that all protocols are interested in the same set of “relevant” events.

Relevant Events Non-deterministic events Dependency-generating events A non-deterministic event is an event whose outcome may change for different executions of the same program. Dependency-generating events These events can increase the number of processes that depend on the nondeterministic events executed by a process. Output-commit events These events can make the external environment depend on the non-deterministic events executed by a process. Checkpointing events These events instruct the protocols to write to stable storage the state of one or more processes. Failure-detection events These events are generated on detecting the failure of one or more processes.

A Simple Language Specifying Rollback-recovery Protocols A protocol is defined in terms the actions it takes in response to non-deterministic events, dependency generating events, output commit events, checkpointing events and failure-detection events. Implementing a specific protocol is equal to selecting the set of actions performed in response to each relevant event. A simple language is used to specify the rollback-recovery protocols.

Module Definitions To define a protocol completely, it is necessary to instantiate a set of variables which specify, for instance, the set of non-deterministic events, the form of their determinant, the implementation of stable storage, etc. Egida identifies a set of building blocks which are incorporated into the protocol structure yield different rollback recovery protocols.

Architecture

Synthesizing Protocols through Module Composition Egida allows the co-existence of multiple implementations for each of the modules. To synthesize a protocol, a specific implementation of each module must be selected. Egida maintains a binding between the values for the modules and their corresponding implementations. Therefore, synthesizing a protocol requires processing the specification along with the binding information to initialize the modules to their appropriate implementations.

Advantages Promote extensibility and flexibility by allowing multiple implementation of each of the core functionalities. Facilitate rapid implementation of rollback recovery protocols with minimal programming effort by gluing together objects from the available library of building blocks. Egida enables designers of fault-tolerance protocols to develop new rollback recovery protocols by combining different implementations of the core functionalities in novel ways.

AQuA: An Adaptive Architecture that provides dependable distributed objects

Overview To allow distributed applications to request and obtain a desired level of availability using a QuO contract through a property manager. Fault tolerance in AQuA is provided by Proteus, which dynamically manages the replication of distributed objects to make them dependable.

Background Ensemble group communication system Maestro 1. ensure reliable communication between groups of processes, 2. ensure atomic delivery of multicasts to groups with changing membership, 3. detect and exclude from the group members that fail by crashing. Maestro Object-oriented interface to Ensemble The Ensemble protocol stacks used in AQuA provide inter-process communication based on the virtual synchrony, both total and causal multicast are used in the AQuA group structure, resulting in a total order of delivered messages between different groups of replicated objects. Ensemble detects process failures through dummy message “I am alive”. Object-oriented applications can be written by deriving from Maestro classes that provide reliable communication.

Background (cont) Quality Objects 1. transmit applications’ availability requirements to Proteus, which attempts to configure the system to achieve the desired availability. 2. provide an adaptation mechanism that is used when Proteus is unable to provide a specified level of availability. QuO allows distributed object-oriented applications to specify dynamic QoS requirements at the application level, using the notion of “contract”, which is a finite state machine specifying actions to be taken based on the state of the distributed system and the desired requirements of the application. The goal of QuO is to develop a common middleware framework, based on distributed object computing, that can manage and integrate non-functional system properties such as network resource constraints, availability requirements and security needs.

Background (cont) Proteus ♠ dependability manager ♠ handlers replicated consists of advisor and protocol coordinator ♠ handlers implement voters and monitors in the gateway ♠ object factories implemented on each host

AQuA Architecture Overview

Group Structure in AQuA AQuA use one general object model, based on interactions between objects that can be replicated. Objects can initiate requests (acting as clients) and respond to requests (acting as servers). Replication groups Connection groups PCS (Proteus Communication Service) group Point-to-Point groups

Fault Tolerance in AQuA Fault Model crash failures, value faults, time faults Error Detection Proteus, voter, monitor Fault Treatment Proteus manager advisor Crash failure occurs when an object stops sending out messages and when the internal state is lost. The crash failure of an object is due to the crash failure of at least one element composing the object. KILL the process as well as gateway process. Value failure occurs when the message arrives in time but contains the wrong content. Proteus deal it. Time fault includes delay and omission faults. Handler in each gateway tolerate it. The detection mechanism used by Proteus to detect crash failures is based on the detection mechanism implemented in Ensemble. Value faults are detected by voters, which is implemented in the gateway. Time errors are detected by monitors that record info regarding various times and omissions.