Download presentation
Presentation is loading. Please wait.
Published byAnn Cox Modified over 9 years ago
1
CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials Replicated State Machines Notion of processes and (stuttering) transitions Problems: processes may crash & then recover consistency issues Assumptions: there are nonvolatile (stable) storage that survives process crashes Liveness and Safety Properties Consensus
2
Highly Available Computing High availability means either perfection or redundancy. –The system can work even when some parts are broken. The simplest redundancy is replication: –Several copies of each part. –Each non-faulty copy does the same thing. Every computing system works as a state machine. So a replicated state machine can do highly available computing. 2
3
(Replicated) State Machines If a state machine is deterministic, then feeding two copies the same inputs will produce the same outputs and states. –We call each copy a process. –So all we need is to agree on the inputs. Examples: –Replicated storage with Read(a) and Write(a, d) steps. –Airplane flight control system with ReadInstrument(i) and RaiseFlaps(d) steps. 3
4
Problems: processes may crash & then recover Assumption: each process has a stable (nonvolatile) storage that survives crashes Need to ensure consistency across all replicated processes: –read(x) operation to any process returns the same value; –write (x<-v) operation results same state change in all processes E.g., use a lock and a two-phase commit (2PC) protocol Replicated State Machines & Consistency read(x) x=v ok write(x<-v)
5
Two Types of Failures or Faults Non-Byzantine Failed nodes stop communicating with other nodes "Clean" failure Fail-stop behavior Byzantine Failed nodes will keep sending messages Incorrect and potentially misleading Failed node becomes a traitor Fail-stop faults/failures are typically what we assume in dealing with failures in distributed systems
6
State Machine Approach A distributed system is: –A finite set of processes A process is: –A set of states, with one initial state –A set of events or actions An execution is a possibly infinite sequence of alternating states/actions s 0 s 1 s 2 0 1 2 6
7
Properties A stuttering transition has the form s s A property is a set of executions closed under stuttering [Abadi, Lamport 1990] –The clock still ticks after a program terminates –Stuttering is also a useful in mapping between levels of abstraction 00 7
8
Safety Properties Informally: A safety properties is one that says something bad doesn’t happen Formally: A property P is a safety property iff: –If is in P then any finite prefix of is in P Additionally, –If is not in P then there is some finite prefix of that is not in P There is a point at which an illegal transition occurred –Safety properties can be finitely refuted. 8
9
Liveness Properties Informally: A liveness property says something good eventually happens Formally: A property P is a liveness property iff: –If every finite behavior is a prefix of some behavior in P Additionally, –Can always “ complete ” a finite behavior into one that is in P –Safety properties cannot be finitely refuted. 9
10
Safety and Liveness Every property (i.e., every set of behaviors) is the conjunction of: –A safety property and –A liveness property Due to Alpern and Schneider, based on basic results from Topology 10
11
Visible Behavior A specification identifies a subset of its actions (or its state variables) as externally visible. A state machine defines a set of allowable executions : –state: a set of values, usually divided into named variables. –actions: named changes in the state; internal and external. They may be nondeterministic –In fact, Lampson encourages this in specs to allow flexibility in implementations 11
12
Implements Y implements X if –every external behavior of Y is an external behavior of X, This expresses the idea that Y implements X if you can’t tell Y apart from X by looking only at the external actions Examples: abstract data types, databases, distributed systems Note: Lampson implicitly deals with finite behaviors, and therefore states the liveness property separately. (Doesn’t treat liveness in the proofs.) 12
13
13 How to Build Highly Available Systems Leslie Lamport’s idea of how to build highly available distributed systems: –Every computing system works as a state machine –So a replicated state machine can do highly available computing
14
14 A Simple Way to Build Consensus Fault-tolerant consensus is expensive! –Exclusive access (locking) is cheap, but not fault tolerant –Leases are “fault-tolerant” locks, which times out like ordinary locks, leases can be hierarchical only the root lease needs to be granted by consensus
15
Agreement on shared state (single system image) Recovers from server failures autonomously –Minority of servers fail: no problem –Majority fail: lose availability, retain consistency Key to building consistent storage systems What is Consensus? servers
16
Top-level system configuration How Is Consensus Used? Replicate entire database state
17
17 Hierarchical Leases
18
18 Consensus & Synchronicity
19
Key Challenges: eliminate single point of failure An ad hoc algorithm –“This case is rare and typically occurs as a result of a network partition with replication lag.” – OR – A consensus algorithm (built-in or library) –Paxos, Raft, … A consensus service –ZooKeeper, etcd, consul, … Inside a Consistent System
20
Consensus Algorithms: Requirements Typically satisfy the following properties Safety: Never return an incorrect result under all kinds of non-Byzantine failures Availability: Remain available as long as a majority of the servers remain operational and can communicate with each other and with clients Robustness: Do not depend on timing to ensure the consistency of states Responsiveness: Commands will typically complete as soon as a majority of servers have responded to a single round of remote procedure calls i.e., one or two slow servers should not impact overall system response times
21
21 References Specifications Lamport, A simple approach to specifying concurrent systems. Comm ACM, 32, 1, Jan. 1989. Impossibility Fischer, Lynch, and Paterson, Impossibility of distributed consensus with one faulty process. J. ACM 32, 2, April 1985. Paxos algorithm Lamport, The part-time parliament. Technical Report 49, Digital Equipment Corp, Palo Alto, Sep. 1989. Liskov and Oki, Viewstamped replication, Proc. 7th PODC, Aug. 1988. State machines Lamport, Using time instead of timeout for fault-tolerant distributed systems. ACM TOPLAS 6, 2, April 1978. Schneider, Implementing fault-tolerant services using the state-machine approach: A tutorial. Computing Surveys 22 (Dec 1990). Lampson’s talk Lampson, How to build a highly available system using consensus. In Distributed Algorithms, ed. Babaoglu and Marzullo, LNCS 1151, Springer, 1996.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.