Abstractions for Fault Tolerance

Abstractions for Fault Tolerance
Kris Malfettone, Adrian Dumchus

Fault Tolerant System Difficult to design and understand
Must maintain control over failure free behavior as well as failure behavior Fault Tolerance Behavior remains well defined when components fail Masks component failures from user

Service A service is a system behavior perceived by the user
Current Service State is a summary of past behavior Services with same set of operations and allowed behaviors are of same type Service Implementation carries out operations for a service, consists of one or more servers

Server Encapsulates private state data using set of procedures (instructions or methods) Allows user to access and change a server’s state Servers are local or centralized - services can be distributed

“Depends On” relation Servers implement services using services implemented by other servers Service U depends on R if R is used in validating server u implementing U is correct Server u is a user (client) of r r is a resource of u Resources may depend on other resources, etc.

Failure Classification
Server is correct if input results in behavior consistent with service specification Failure occurs when server does not behave according to specification

Types of Failures Omission Failure - does not provide response to an input Timing Failure - response is correct, but either early or late (performance failure) Response Failure - responds incorrectly value failure state transition failure

Types of Failures (cont.)
Crash Failure - after first omission failure, all input results in omission until restart amnesia crash - restarts in predefined init state independent of pre-crash inputs partial-amnesia crash - part of state is same as before crash, rest is init pause-crash - restart in state prior to crash halting-crash- never restarts

Failure Semantics Specification should, in addition to including failure-free semantics, include all likely failure behaviors The more failure behaviors allowed, the weaker the failure class Arbitrary failure semantics – weakest, any failure behavior can occur for a service The stronger the failure semantics, the more expensive and complex the server

Failure Masking Server failures can be masked from users two ways:
By higher level servers By using server groups

Hierarchical masking Failure at lower level may result in different type of failure at higher level Preferable for servers to have semantics stronger than arbitrary (omission, crash, or performance) Exception handling – way to propagate failure info across abstraction levels and to mask lower level failures. If masking fails at one level, failure info is propagated to next level where attempts to mask are continued

Group Masking Ensures services remain available using a group of redundant, physically independent servers If some fail, remaining servers provide the service and mask the original failure

Understanding Fault-Tolerance
Basic goals of fault-tolerance: Hide component failures as long as enough redundancy is available Provide manageable failure behavior so that users can easily recover from failures

Commonly used Fault-Tolerant Services
Processors with Crash failure semantics Most processors provide amnesia-crash Two main methods: Error detecting codes More complex, less reliable, increased testing and design costs Duplication and matching Provides better approximation of crash failure semantics

Commonly used Fault-Tolerant Services (cont.)
Stable storage with omission failure semantics Error detecting/correcting codes Ensure volatile and persistent storage Ensure read omission failure semantics assuming no operating system crash occurs When operating system crashes, common method is using mirrored persistent storage servers Failure by one server is masked by another server If masking attempt fails, omission failure reported Higher level crashes (files, databases, etc.) use logging and recovery algorithms

Restartable servers Mask failures of non-replicated servers by restarting them when they fail Clients resend service requests until eventually succeed Requires servers to periodically checkpoint their state When server is restarted after a crash, must re-execute all logged service requests since last checkpoint

Point-to-point communication services Error correcting codes used at lowest, physical layer for masking Negative or positive acknoweledgements used at higher levels for masking

Distributed storage services Clients may see omission and performance failures, but not more complex response failures Atomic commit problem – goal is to ensure that sequence of updates is either made permanent or entirely aborted To ensure failure masking, uses dually persistent storages devices and hierarchical techniques to use alternate paths should primary path fail Communication failures are either masked by hierarchical techniques (retransmissions), or result in omission failures.

Restartable arbitrary distributed services Difficulty when checkpointing in capturing global state of all local server states, as well as the state of communications between the servers

Replicated storage and servers Use of server groups raises issues: How to maintain consistent state Must agree on group membership Must agree on order of service requests and state updates within group (atomic broadcast problem) How servers in group should communicate How to ensure required number of servers are running

In Conclusion Fundamental concepts include notions of service, server, and the ‘depends upon’ relation Concepts which capture the goals of fault-tolerant computing – failure semantics, hierarchical failure masking, and group failure masking Mask failures when possible Ensure system has clearly specified, manageable failure semantics when masking not possible

Abstractions for Fault Tolerance

Similar presentations

Presentation on theme: "Abstractions for Fault Tolerance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Abstractions for Fault Tolerance

Similar presentations

Presentation on theme: "Abstractions for Fault Tolerance"— Presentation transcript:

Similar presentations

About project

Feedback