CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of.

CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of processes and (stuttering) transitions Problems: processes may crash & then recover  consistency issues  Assumptions: there are nonvolatile (stable) storage that survives process crashes  Liveness and Safety Properties  Consensus

Highly Available Computing High availability means either perfection or redundancy. –The system can work even when some parts are broken. The simplest redundancy is replication: –Several copies of each part. –Each non-faulty copy does the same thing. Every computing system works as a state machine. So a replicated state machine can do highly available computing. 2

(Replicated) State Machines If a state machine is deterministic, then feeding two copies the same inputs will produce the same outputs and states. –We call each copy a process. –So all we need is to agree on the inputs. Examples: –Replicated storage with Read(a) and Write(a, d) steps. –Airplane flight control system with ReadInstrument(i) and RaiseFlaps(d) steps. 3

 Problems: processes may crash & then recover  Assumption: each process has a stable (nonvolatile) storage that survives crashes  Need to ensure consistency across all replicated processes: –read(x) operation to any process returns the same value; –write (x<-v) operation results same state change in all processes  E.g., use a lock and a two-phase commit (2PC) protocol Replicated State Machines & Consistency read(x) x=v ok write(x<-v)

Two Types of Failures or Faults  Non-Byzantine  Failed nodes stop communicating with other nodes "Clean" failure Fail-stop behavior  Byzantine  Failed nodes will keep sending messages Incorrect and potentially misleading Failed node becomes a traitor Fail-stop faults/failures are typically what we assume in dealing with failures in distributed systems

State Machine Approach A distributed system is: –A finite set of processes A process is: –A set of states, with one initial state –A set of events or actions An execution is a possibly infinite sequence of alternating states/actions s 0 s 1 s 2  0  1  2 6

Properties A stuttering transition has the form s s A property is a set of executions closed under stuttering [Abadi, Lamport 1990] –The clock still ticks after a program terminates –Stuttering is also a useful in mapping between levels of abstraction 00 7

Safety Properties Informally: A safety properties is one that says something bad doesn’t happen Formally: A property P is a safety property iff: –If  is in P then any finite prefix of  is in P Additionally, –If  is not in P then there is some finite prefix of  that is not in P There is a point at which an illegal transition occurred –Safety properties can be finitely refuted. 8

Liveness Properties Informally: A liveness property says something good eventually happens Formally: A property P is a liveness property iff: –If every finite behavior is a prefix of some behavior in P Additionally, –Can always “ complete ” a finite behavior into one that is in P –Safety properties cannot be finitely refuted. 9

Safety and Liveness Every property (i.e., every set of behaviors) is the conjunction of: –A safety property and –A liveness property Due to Alpern and Schneider, based on basic results from Topology 10

Visible Behavior A specification identifies a subset of its actions (or its state variables) as externally visible. A state machine defines a set of allowable executions : –state: a set of values, usually divided into named variables. –actions: named changes in the state; internal and external. They may be nondeterministic –In fact, Lampson encourages this in specs to allow flexibility in implementations 11

Implements Y implements X if –every external behavior of Y is an external behavior of X, This expresses the idea that Y implements X if you can’t tell Y apart from X by looking only at the external actions Examples: abstract data types, databases, distributed systems Note: Lampson implicitly deals with finite behaviors, and therefore states the liveness property separately. (Doesn’t treat liveness in the proofs.) 12

13 How to Build Highly Available Systems Leslie Lamport’s idea of how to build highly available distributed systems: –Every computing system works as a state machine –So a replicated state machine can do highly available computing

14 A Simple Way to Build Consensus Fault-tolerant consensus is expensive! –Exclusive access (locking) is cheap, but not fault tolerant –Leases are “fault-tolerant” locks, which times out like ordinary locks, leases can be hierarchical only the root lease needs to be granted by consensus

Agreement on shared state (single system image) Recovers from server failures autonomously –Minority of servers fail: no problem –Majority fail: lose availability, retain consistency Key to building consistent storage systems What is Consensus? servers

Top-level system configuration How Is Consensus Used? Replicate entire database state

17 Hierarchical Leases

18 Consensus & Synchronicity

Key Challenges: eliminate single point of failure An ad hoc algorithm –“This case is rare and typically occurs as a result of a network partition with replication lag.” – OR – A consensus algorithm (built-in or library) –Paxos, Raft, … A consensus service –ZooKeeper, etcd, consul, … Inside a Consistent System

Consensus Algorithms: Requirements Typically satisfy the following properties  Safety: Never return an incorrect result under all kinds of non-Byzantine failures  Availability: Remain available as long as a majority of the servers remain operational and can communicate with each other and with clients  Robustness: Do not depend on timing to ensure the consistency of states  Responsiveness: Commands will typically complete as soon as a majority of servers have responded to a single round of remote procedure calls  i.e., one or two slow servers should not impact overall system response times

21 References Specifications Lamport, A simple approach to specifying concurrent systems. Comm ACM, 32, 1, Jan. 1989. Impossibility Fischer, Lynch, and Paterson, Impossibility of distributed consensus with one faulty process. J. ACM 32, 2, April 1985. Paxos algorithm Lamport, The part-time parliament. Technical Report 49, Digital Equipment Corp, Palo Alto, Sep. 1989. Liskov and Oki, Viewstamped replication, Proc. 7th PODC, Aug. 1988. State machines Lamport, Using time instead of timeout for fault-tolerant distributed systems. ACM TOPLAS 6, 2, April 1978. Schneider, Implementing fault-tolerant services using the state-machine approach: A tutorial. Computing Surveys 22 (Dec 1990). Lampson’s talk Lampson, How to build a highly available system using consensus. In Distributed Algorithms, ed. Babaoglu and Marzullo, LNCS 1151, Springer, 1996.

CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of.

Similar presentations

Presentation on theme: "CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of.

Similar presentations

Presentation on theme: "CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of."— Presentation transcript:

Similar presentations

About project

Feedback