Replicated State Machines ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.

Slides:



Advertisements
Similar presentations
Data Communications and Networking
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Impossibility of Distributed Consensus with One Faulty Process
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
BASIC BUILDING BLOCKS -Harit Desai. Byzantine Generals Problem If a computer fails, –it behaves in a well defined manner A component always shows a zero.
Teaser - Introduction to Distributed Computing
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Silberschatz and Galvin  Operating System Concepts Module 16: Distributed-System Structures Network-Operating Systems Distributed-Operating.
Consensus Hao Li.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Dependability ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University August.
Computer Systems/Operating Systems - Class 8
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Replication Management using the State-Machine Approach Fred B. Schneider Summary and Discussion : Hee Jung Kim and Ying Zhang October 27, 2005.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 3: Fault-Tolerant.
Mini Project ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University August.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
Documentation ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University August.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Introduction ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University August.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
CH2 System models.
1 CS 501 Spring 2003 CS 501: Software Engineering Lecture 16 System Architecture and Design II.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Distributed Systems: Concepts and Design Chapter 1 Pages
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Heterogeneous Multikernel OS Yauhen Klimiankou BSUIR
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
Toward Fault-tolerant P2P Systems: Constructing a Stable Virtual Peer from Multiple Unstable Peers Kota Abe, Tatsuya Ueda (Presenter), Masanori Shikano,
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
Fault Tolerance (2). Topics r Reliable Group Communication.
Distributed Mutual Exclusion Synchronization in Distributed Systems Synchronization in distributed systems are often more difficult compared to synchronization.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Replication Hari Shreedharan. Papers! ● Implementing Fault-Tolerant Services using the State Machine Approach (Dec, 1990) Fred B. Schneider Cornell University.
Faults and fault-tolerance
Real-time Software Design
Faults and fault-tolerance
Agreement Protocols CS60002: Distributed Systems
EECS 498 Introduction to Distributed Systems Fall 2017
Faults and fault-tolerance
Jacob Gardner & Chuan Guo
CS 501: Software Engineering Fall 1999
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Fault-Tolerant State Machine Replication
Fault Tolerance Distributed
Distributed Systems (15-440)
Presentation transcript:

Replicated State Machines ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University September 2011

A simple State machine Object-oriented class SM { void method m_1(par_1) {... OB.m(arg);... }... } Message-oriented process SM{ (m,args)= getMessage(); switch m { case m_1:... sendMessage(OB,m,arg)... } Note: Asynchronous communication, cf. Module 1

Constraints Asynchronous message passing (unbounded buffering). Thus it must be proved no buffer-overflow for an implementation. No timing (delays, timeouts) in state machines. State machines are scheduled as a set of periodic or sporadic processes

Fault Tolerance Byzantine failures: SMs may fail in any way. Requires 2t+1 replicas to tolerate t failures. Fail-stop failures: Failing processors stop and the stop state is detectable. Only t+1 replicas needed.

Agreement and Order Every request message is received by every non-faulty processor. This requires reliable message passing – a fault in a particular link translates to a byzantine failure for the receiving state machine Requests are processed in order. Requests sent from same destination cannot overtake each other. Cf. TCP and UDP in Internet

Agreement IC1: Select a non-faulty transmitter IC2: Ensure that the value sent by the transmitter is recieved by all other non- faulty processors The difficult part is implementing a move of the transmitter, cf. Token rings. Alternative. Broadcasts

Watch-dogs for Fail-stop Logical clock stability test

Dynamic Configurations C – clients S – state machines O – output devices This state machine could be the watch dog.

Integration after repair Resynchronization with getting a check- pointed state from a replica. Alignment with received messages.

Perspective A general paradigm suitable for highly critical distributed processing. Fail-stop may be feasible for medium level criticality. Both may become cost-efficient in a multi- core setting. Requires highly dependable hardware and kernel support.