V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.
Advertisements

Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Chapter 8 Fault Tolerance
1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.
Reliable Group Communication Quanzeng You & Haoliang Wang.
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
Byzantine Generals Problem: Solution using signed messages.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Last Class: Weak Consistency
1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.
Fault Tolerance Dealing successfully with partial failure within a Distributed System. Key technique: Redundancy.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Group Communication Group oriented activities are steadily increasing. There are many types of groups:  Open and Closed groups  Peer-to-peer and hierarchical.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
1 Chapter 12 Consensus ( Fault Tolerance). 2 Reliable Systems Distributed processing creates faster systems by exploiting parallelism but also improve.
Hwajung Lee. A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types of groups:
Fault Tolerance in Distributed Systems Suvendu Rup Assistant Professor IIIT Bhubaneswar.
Fault Tolerance in Distributed Systems Naim Aksu.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Introduction to Fault Tolerance By Sahithi Podila.
1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fault Tolerance (2). Topics r Reliable Group Communication.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
Fault Tolerance Part I Introduction Part II Process Resilience
8.2. Process resilience Shreyas Karandikar.
DC7: More Coordination Chapter 11 and 14.2
Chapter 8 Fault Tolerance Part I Introduction.
COMP28112 – Lecture 14 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 13-Oct-18 COMP28112.
Reliable group communication
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Fault Tolerance - Transactions
Distributed Systems CS
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 19-Nov-18 COMP28112.
Agreement Protocols CS60002: Distributed Systems
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Introduction to Fault Tolerance
Distributed Systems CS
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 22-Feb-19 COMP28112.
Distributed Systems CS
Fault Tolerance - Transactions
Distributed Systems - Comp 655
Distributed Systems CS
Presentation transcript:

V1.7Fault Tolerance1

V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed system. This is one feature that distinguishes them from non-distributed systems. A distributed system must be able to recover from partial failures and continue to run in an acceptable way.

V1.7Fault Tolerance3 Basic Concepts Availability – probability that the system is operating correctly at any given time. Reliability – the length of time that a system can run without failure Safety – if part of (or the whole of) a system fails nothing catastrophic should happen Maintainability – how easy it is to repair a system

V1.7Fault Tolerance4 Failure Models Type of failureDescription Crash failureA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failureA server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failureA server may produce arbitrary responses at arbitrary times

V1.7Fault Tolerance5 Failure Masking by Redundancy If three replicated servers have a mean time between failure of ten days and on average are down for 12 hours when they fail what is the availability of the service?

V1.7Fault Tolerance6 Failure Masking by Redundancy If three replicated servers have a mean time between failure of 10 days and on average are down for 12 hours when they fail what is the availability of the service? Probability that any one server is unavailable: 12/(10*24) or 0.05 Prob. that three servers are unavailable: or Prob. that at least one server is available is: or

V1.7Fault Tolerance7 Triple Modular Redundancy

V1.7Fault Tolerance8 Process Resilience Design Issues –Organise identical processes into a group –Group membership may be dynamic –Group membership should be hidden from clients –How requests get to group members must be decided

V1.7Fault Tolerance9 Agreement in Faulty Systems Two Army Problem –Perfect Processes, Faulty Comms (Lost messages) –Red army (1 x 5000) vs Blue army 2 x 3000) –Blue 1 to Blue 2 “Attack at dawn?” –Blue 2 to Blue 1 “OK” –Blue 1 to Blue 2 “OK message received –etc. ad infinitem Agreement between two processes in the face of faulty communication is not possible

V1.7Fault Tolerance10 Byzantine Generals Problem (1) Perfect Comms, Imperfect Processes One red army, n blue armies ( m traitorous generals) Communication by telephone (fully connected, point to point) Blue generals want to exchange group strength Traitorous generals are pathological liars

V1.7Fault Tolerance11 Byzantine Generals Problem (2) The Byzantine generals problem for 3 loyal generals and 1 traitor. a)The generals announce their troop strengths (in units of 1 kilosoldiers). b)The vectors that each general assembles based on (a) c)The vectors that each general receives in step 3.

V1.7Fault Tolerance12 Byzantine Generals Problem (3) In the final step each general looks for a majority from the vectors received, otherwise marks the troop strength unknown. Lamport proved that in a system with m faulty processes agreement can only be obtained if there are 2m+1 correctly functioning processes (more than 2/3).

V1.7Fault Tolerance13 Reliable Group Communication Often need to send update messages reliable to a group of servers e.g. replicated databases. Need to know who is in the group Need to ensure that every message sent gets to every member of the group

V1.7Fault Tolerance14 Basic Reliable Multicast System (1) A weak multicast system may only require that all messages get delivered. This can be simply implemented by sending a monotonically increasing message identifier. Each receiver acknowledges each message with and acknowledgment.

V1.7Fault Tolerance15 Basic Reliable Multicast System (2) A simple solution to reliable multicasting when all receivers are known and are assumed not to fail a)Message transmission b)Reporting feedback

V1.7Fault Tolerance16 Basic Reliable Multicast System (3) Not very scaleable if N processes then N-1 acknowledgement messages (Feedback Implosion) Could return only negative acknowledgements but sender is forced to keep messages sent for an un- bounded time. Negative acks may be broadcast to further reduce the risk of feedback implosion. Hierarchical approaches may also be used

V1.7Fault Tolerance17 Atomic Multicast Attempts to ensure: –Messages delivered to all on none of the processes in the group –Messages are delivered in the same order to every process Several replicas of a data base may exist If one crashes a mechanism to deliver the missed messages in the right order must exist

V1.7Fault Tolerance18 Message Ordering Reliable Unordered Reliable FIFO ordered – messages sent from the same process get delivered in the same order Causally Ordered – if message m1 could have caused message m2 to be sent, m1 must be delivered before m2 Totally Ordered – delivered in same order to all group members