Abstractions for Fault Tolerance

Slides:



Advertisements
Similar presentations
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Advertisements

Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
E-Transactions: End-to-End Reliability for Three-Tier Architectures Svend Frølund and Rachid Guerraoui.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Last Class: Weak Consistency
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
Distributed Deadlocks and Transaction Recovery.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Chap 7: Consistency and Replication
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
Faults and fault-tolerance
Outline Introduction Background Distributed DBMS Architecture
Chapter 9, Testing.
Operating System Reliability
Operating System Reliability
Fault Tolerance In Operating System
Fault Tolerance - Transactions
Fault Tolerance - Transactions
CS 632 Lecture 6 Recovery Principles of Transaction-Oriented Database Recovery Theo Haerder, Andreas Reuter, 1983 ARIES: A Transaction Recovery Method.
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
Operating System Reliability
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
Faults and fault-tolerance
EEC 688/788 Secure and Dependable Computing
Fault Tolerance - Transactions
Middleware for Fault Tolerant Applications
Introduction to Fault Tolerance
EEC 688/788 Secure and Dependable Computing
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Lecture 21: Replication Control
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Fault Tolerance - Transactions
Understanding Fault-Tolerant Distributed Systems  A. Mok 2018
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
UNIVERSITAS GUNADARMA
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Fault Tolerance - Transactions
Presentation transcript:

Abstractions for Fault Tolerance Kris Malfettone, Adrian Dumchus

Fault Tolerant System Difficult to design and understand Must maintain control over failure free behavior as well as failure behavior Fault Tolerance Behavior remains well defined when components fail Masks component failures from user

Service A service is a system behavior perceived by the user Current Service State is a summary of past behavior Services with same set of operations and allowed behaviors are of same type Service Implementation carries out operations for a service, consists of one or more servers

Server Encapsulates private state data using set of procedures (instructions or methods) Allows user to access and change a server’s state Servers are local or centralized - services can be distributed

“Depends On” relation Servers implement services using services implemented by other servers Service U depends on R if R is used in validating server u implementing U is correct Server u is a user (client) of r r is a resource of u Resources may depend on other resources, etc.

Failure Classification Server is correct if input results in behavior consistent with service specification Failure occurs when server does not behave according to specification

Types of Failures Omission Failure - does not provide response to an input Timing Failure - response is correct, but either early or late (performance failure) Response Failure - responds incorrectly value failure state transition failure

Types of Failures (cont.) Crash Failure - after first omission failure, all input results in omission until restart amnesia crash - restarts in predefined init state independent of pre-crash inputs partial-amnesia crash - part of state is same as before crash, rest is init pause-crash - restart in state prior to crash halting-crash- never restarts

Failure Semantics Specification should, in addition to including failure-free semantics, include all likely failure behaviors The more failure behaviors allowed, the weaker the failure class Arbitrary failure semantics – weakest, any failure behavior can occur for a service The stronger the failure semantics, the more expensive and complex the server

Failure Masking Server failures can be masked from users two ways: By higher level servers By using server groups

Hierarchical masking Failure at lower level may result in different type of failure at higher level Preferable for servers to have semantics stronger than arbitrary (omission, crash, or performance) Exception handling – way to propagate failure info across abstraction levels and to mask lower level failures. If masking fails at one level, failure info is propagated to next level where attempts to mask are continued

Group Masking Ensures services remain available using a group of redundant, physically independent servers If some fail, remaining servers provide the service and mask the original failure

Understanding Fault-Tolerance Basic goals of fault-tolerance: Hide component failures as long as enough redundancy is available Provide manageable failure behavior so that users can easily recover from failures

Commonly used Fault-Tolerant Services Processors with Crash failure semantics Most processors provide amnesia-crash Two main methods: Error detecting codes More complex, less reliable, increased testing and design costs Duplication and matching Provides better approximation of crash failure semantics

Commonly used Fault-Tolerant Services (cont.) Stable storage with omission failure semantics Error detecting/correcting codes Ensure volatile and persistent storage Ensure read omission failure semantics assuming no operating system crash occurs When operating system crashes, common method is using mirrored persistent storage servers Failure by one server is masked by another server If masking attempt fails, omission failure reported Higher level crashes (files, databases, etc.) use logging and recovery algorithms

Commonly used Fault-Tolerant Services (cont.) Restartable servers Mask failures of non-replicated servers by restarting them when they fail Clients resend service requests until eventually succeed Requires servers to periodically checkpoint their state When server is restarted after a crash, must re-execute all logged service requests since last checkpoint

Commonly used Fault-Tolerant Services (cont.) Point-to-point communication services Error correcting codes used at lowest, physical layer for masking Negative or positive acknoweledgements used at higher levels for masking

Commonly used Fault-Tolerant Services (cont.) Distributed storage services Clients may see omission and performance failures, but not more complex response failures Atomic commit problem – goal is to ensure that sequence of updates is either made permanent or entirely aborted To ensure failure masking, uses dually persistent storages devices and hierarchical techniques to use alternate paths should primary path fail Communication failures are either masked by hierarchical techniques (retransmissions), or result in omission failures.

Commonly used Fault-Tolerant Services (cont.) Restartable arbitrary distributed services Difficulty when checkpointing in capturing global state of all local server states, as well as the state of communications between the servers

Commonly used Fault-Tolerant Services (cont.) Replicated storage and servers Use of server groups raises issues: How to maintain consistent state Must agree on group membership Must agree on order of service requests and state updates within group (atomic broadcast problem) How servers in group should communicate How to ensure required number of servers are running

In Conclusion Fundamental concepts include notions of service, server, and the ‘depends upon’ relation Concepts which capture the goals of fault-tolerant computing – failure semantics, hierarchical failure masking, and group failure masking Mask failures when possible Ensure system has clearly specified, manageable failure semantics when masking not possible