Last Class: Weak Consistency

Slides:

Advertisements

Similar presentations

Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.

Advertisements

Chapter 8 Fault Tolerance

1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.

Fault Tolerance (I).

Byzantine Generals. Outline r Byzantine generals problem.

The Byzantine Generals Problem Boon Thau Loo CS294-4.

Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class: Web Caching Use web caching as an illustrative example Distribution protocols –Invalidate.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.

Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.

CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.

Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.

Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.

Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.

Computer Science Lecture 10, page 1 CS677: Distributed OS Canonical Problems in Distributed Systems Time ordering and clock synchronization Leader election.

CSC 536 Lecture 6. Outline Fault tolerance Redundancy and replication Process groups Reliable client-server communication Fault tolerance in Akka “Let.

Chapter 19 Recovery and Fault Tolerance Copyright © 2008.

1 Chapter 12 Consensus ( Fault Tolerance). 2 Reliable Systems Distributed processing creates faster systems by exploiting parallelism but also improve.

Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.

CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.

CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.

Fault Tolerance Chapter 7.

Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.

Kyung Hee University 1/33 Fault Tolerance Chap 7.

V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.

Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.

Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.

Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.

Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.

Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.

Introduction to Fault Tolerance By Sahithi Podila.

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.

Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.

PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.

Distributed Computing COEN 317 DC1: Introduction.

Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

CSE 486/586 Distributed Systems Byzantine Fault Tolerance

Fault Tolerance Prof. Orhan Gemikonakli

Fault Tolerance Chap 7.

Faults and fault-tolerance

Distributed Computing

8.2. Process resilience Shreyas Karandikar.

Fault Tolerance In Operating System

Chapter 8 Fault Tolerance Part I Introduction.

COMP28112 – Lecture 14 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 13-Oct-18 COMP28112.

Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.

Fault Tolerance - Transactions

Distributed Systems CS

CSE 486/586 Distributed Systems Byzantine Fault Tolerance

Fault Tolerance - Transactions

COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 19-Nov-18 COMP28112.

Agreement Protocols CS60002: Distributed Systems

Outline Announcements Fault Tolerance.

Fault Tolerance Distributed Web-based Systems

Faults and fault-tolerance

Fault Tolerance - Transactions

Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel

Introduction to Fault Tolerance

Distributed Systems CS

COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 22-Feb-19 COMP28112.

Distributed Systems CS

Fault Tolerance - Transactions

Fault Tolerance and Reliability in DS.

Abstractions for Fault Tolerance

CSE 486/586 Distributed Systems Byzantine Fault Tolerance

Last Class: Fault Tolerance

Fault Tolerance - Transactions

Presentation transcript:

Last Class: Weak Consistency Eventual Consistency and epidemic protocols Implementing consistency techniques Primary-based Replicated writes-based Quorum protocols CS677: Distributed OS

Gifford’s Quorum-Based Protocol Three examples of the voting algorithm: A correct choice of read and write set A choice that may lead to write-write conflicts A correct choice, known as ROWA (read one, write all) CS677: Distributed OS

Today: Fault Tolerance Basic concepts in fault tolerance Masking failure by redundancy Process resilience CS677: Distributed OS

Motivation Single machine systems Failures are all or nothing OS crash, disk failures Distributed systems: multiple independent nodes Partial failures are also possible (some nodes fail) Question: Can we automatically recover from partial failures? Important issue since probability of failure grows with number of independent components (nodes) in the systems Prob(failure) = Prob(Any one component fails)=1-P(no failure) CS677: Distributed OS

A Perspective Computing systems are not very reliable OS crashes frequently (Windows), buggy software, unreliable hardware, software/hardware incompatibilities Until recently: computer users were “tech savvy” Could depend on users to reboot, troubleshoot problems Growing popularity of Internet/World Wide Web “Novice” users Need to build more reliable/dependable systems Example: what is your TV (or car) broke down every day? Users don’t want to “restart” TV or fix it (by opening it up) Need to make computing systems more reliable CS677: Distributed OS

Basic Concepts Need to build dependable systems Requirements for dependable systems Availability: system should be available for use at any given time 99.999 % availability (five 9s) => very small down times Reliability: system should run continuously without failure Safety: temporary failures should not result in a catastrophic Example: computing systems controlling an airplane, nuclear reactor Maintainability: a failed system should be easy to repair CS677: Distributed OS

Basic Concepts (contd) Fault tolerance: system should provide services despite faults Transient faults Intermittent faults Permanent faults CS677: Distributed OS

Failure Models Different types of failures. Type of failure Description Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times Different types of failures. CS677: Distributed OS

Failure Masking by Redundancy Triple modular redundancy. CS677: Distributed OS

Process Resilience Handling faulty processes: organize several processes into a group All processes perform same computation All messages are sent to all members of the group Majority need to agree on results of a computation Ideally want multiple, independent implementations of the application (to prevent identical bugs) Use process groups to organize such processes CS677: Distributed OS

Flat Groups versus Hierarchical Groups Advantages and disadvantages? CS677: Distributed OS

Agreement in Faulty Systems How should processes agree on results of a computation? K-fault tolerant: system can survive k faults and yet function Assume processes fail silently Need (k+1) redundancy to tolerant k faults Byzantine failures: processes run even if sick Produce erroneous, random or malicious replies Byzantine failures are most difficult to deal with Need ? Redundancy to handle Byzantine faults CS677: Distributed OS

Byzantine Faults Simplified scenario: two perfect processes with unreliable channel Need to reach agreement on a 1 bit message Two army problem: Two armies waiting to attack Each army coordinates with a messenger Messenger can be captured by the hostile army Can generals reach agreement? Property: Two perfect process can never reach agreement in presence of unreliable channel Byzantine generals problem: Can N generals reach agreement with a perfect channel? M generals out of N may be traitors CS677: Distributed OS

Byzantine Generals Problem Recursive algorithm by Lamport The Byzantine generals problem for 3 loyal generals and 1 traitor. The generals announce their troop strengths (in units of 1 kilosoldiers). The vectors that each general assembles based on (a) The vectors that each general receives in step 3. CS677: Distributed OS

Byzantine Generals Problem Example The same as in previous slide, except now with 2 loyal generals and one traitor. Property: With m faulty processes, agreement is possible only if 2m+1 processes function correctly [Lamport 82] Need more than two-thirds processes to function correctly CS677: Distributed OS