CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Set 17: Fault-Tolerant Register Simulations CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS CSCE 668 Spring 2014 Prof. Jennifer Welch

Fault-Tolerant Shared Memory Simulations
Previous algorithms implemented shared variable on top of message passing, assuming no failures. What if some processors might crash? Can we still provide a shared read/write variable on top of message passing? Yes, even in an asynchronous system, if we have enough nonfaulty processors. First, we must specify a failure-prone shared memory. Set 17: Fault-Tolerant Register Simulations CSCE 668

Specification of f-Resilient Shared Memory
Inputs are invocations on the shared object. Outputs are responses of the shared object. A sequence of inputs and outputs is allowable iff: there is a partitioning of proc. indices into "faulty" and "nonfaulty" Correct Interaction: each proc. alternates invocations and matching responses Nonfaulty Liveness: Every invocation by a nonfaulty proc. has a matching response Extended Linearizability: Linearizability holds for all the completed operations and some subset of the pending operations some ops might never complete Set 17: Fault-Tolerant Register Simulations CSCE 668

Assumptions for Algorithm
Each read/write variable ("register") to be simulated has one reader and one writer (Next topic will be to build more powerful variables out of these.) There are n procs. which are cooperating to simulate a collection of such variables (any number) Underlying communication system is asynchronous point-to-point message passing n > 2f (less than half the processors can crash) Set 17: Fault-Tolerant Register Simulations CSCE 668

Main Ideas of Algorithm
Each simulated register has a replica stored at each of the n procs., not just at the designated reader and writer of that register. Include a sequence number with the register value to indicate most recently written value Use the redundant storage to provide fault-tolerance. Describe algorithm just for one simulated register; use a separate copy of the same algorithm in parallel for each simulated register. Set 17: Fault-Tolerant Register Simulations CSCE 668

Writing the Simulated Register
generate the next sequence number send a message with the value and the sequence number to all the procs. each recipient updates its local copy of the register, if value just received has a newer sequence number than stored value wait to get back an ack from > n/2 procs. safe since n - f > n/2 do the ack for the write Set 17: Fault-Tolerant Register Simulations CSCE 668

Reading the Simulated Register
send a request to all the procs. each recipient sends back current value of its replica together with the sequence number of that value wait to get reply from > n/2 procs. return value associated with largest sequence number Set 17: Fault-Tolerant Register Simulations CSCE 668

Key Idea for Correctness
Each read should return the value of "the most recent" write. Each read or write communicates with > n/2 procs., so the set of procs. participating in operation OP1 is guaranteed to intersect with the set of procs. participating in any other operation OP2. Set 17: Fault-Tolerant Register Simulations CSCE 668

But What About Asynchrony?
The underlying communication system is asynchronous: message on behalf of one operation could be overtaken by a message on behalf of a later operation. Avoid such problems by adding additional mechanism to the algorithm: reader and writer keep track of "status" of each link don't send a msg on a link until ack from previous msg has been received Set 17: Fault-Tolerant Register Simulations CSCE 668

Outline of Correctness Proof
Interesting part is proving(extended) linearizability. Let ts(W) = sequence number of W Let ts(R) = sequence number of write that R reads from Let O1  O2 denote O1 finishes before O2 starts Key lemmas: If W1  W2, then ts(W1) < ts(W2) If W  R, then ts(W) ≤ ts(R) If R  W, then ts(R ) < ts(W) If R1  R2, then ts(R1) ≤ ts(R2) Set 17: Fault-Tolerant Register Simulations CSCE 668

Matching Lower Bound on Resiliency
Theorem (10.22): No simulation of a 1-reader, 1- writer read/write linearizable register using n procs and asynchronous message passing can tolerate f ≥ n/2 crash failures. Proof: Suppose in contradiction there is an algorithm A that tolerates f = n/2 crashes and simulates a 1- reader, 1-writer linearizable register on top of asynchronous message passing. Set 17: Fault-Tolerant Register Simulations CSCE 668

Lower Bound Proof Partition procs into two sets, S0 and S1, each of size f. Let 0 be admissible exec. of A s.t. initial value of simulated register is 0 all procs. in S1 crash initially proc. p0 in S0 invokes write(1) at time 0 and no other operations are invoked. the write completes at some time t0 without any proc in S0 receiving a message from any proc in S1: must happen since A is supposed to tolerate f failures. Set 17: Fault-Tolerant Register Simulations CSCE 668

X S1 S0 p0 0: Set 17: Fault-Tolerant Register Simulations CSCE 668

Lower Bound Proof Let 1 be admissible exec. of A s.t.
initial value of simulated register is 0 all procs. in S0 crash initially proc. p1 in S1 invokes a read at time t0+1 and no other operations are invoked. the read completes at some time t1 without any proc. in S1 receiving a message from any proc. in S0: must happen since A is supposed to tolerate f failures the read returns 0: must be since A guarantees linearizability Set 17: Fault-Tolerant Register Simulations CSCE 668

X p1 1: Set 17: Fault-Tolerant Register Simulations CSCE 668

Lower Bound Proof Now create admissible execution  by "merging" the views of procs in S0 from 0 and the views of procs in S1 from 1: messages that go between S0 and S1 are delayed so that they don't arrive until after time t1.  is not linearizable, since read(0) follows write(1). Contradiction. Set 17: Fault-Tolerant Register Simulations CSCE 668

X X S0 S1 0: : 1: delay until after t1 p0 p0 p1 p1
Set 17: Fault-Tolerant Register Simulations CSCE 668

Lower Bound Diagram for n = 2
time t0 t0+1 t1 p0 p1 X o: write(1) p0 p1 X 1: read(0) write(1) p0 p1 : read(0) Set 17: Fault-Tolerant Register Simulations CSCE 668

Simulating R/W Registers Using R/W Registers
The previous algorithm showed how to simulate a 1- reader, 1-writer register on top of message passing. How can we get more powerful (flexible) registers, i.e., with more readers more writers We'll start with a warm-up: simulate multi-valued register using binary-valued registers 1-reader and 1-writer Set 17: Fault-Tolerant Register Simulations CSCE 668

Wait-Free Register Simulations
Asynchronous model Linearizable shared registers Wait-free tolerate any number of crash failures We want to simulate one kind of (n-1)-resilient shared memory with another kind of (n-1)-resilient memory recall earlier definition of f-resilient shared memory recall earlier definition of one kind of communication system simulating another Set 17: Fault-Tolerant Register Simulations CSCE 668

Alternative Definition of Wait-Free Simulation
Alternative definition for the wait-free shared memory case: The failure-free version of one (SM) communication system simulates the failure-free version of the other, and for any prefix of an admissible execution of the simulation algorithm in which pi has a pending operation, there is an extension in which the operation completes and only pi takes steps. Equivalent to previous definition, sometimes more convenient. Set 17: Fault-Tolerant Register Simulations CSCE 668

Proving Linearizability
We've seen one approach: explicitly construct a permutation and prove that it has the desired properties Alternative approach: identify a time point for each operation, between invocation and response: linearization points Linearization points give the permutation Obviously real-time order is preserved Just need to show that legality holds Set 17: Fault-Tolerant Register Simulations CSCE 668

Overview of Register Simulations
single-reader single-writer binary-valued single-reader single-writer multi-valued multi-reader single-writer multi-valued multi-reader multi-writer multi-valued Set 17: Fault-Tolerant Register Simulations CSCE 668

Multi-Valued From Binary
Some ideas… Use a different binary register to store each bit of the multi-valued register being simulated Read algorithm is to read all the binary registers and return the resulting value Write algorithm is to write the new bits in some order Difficulties arise if the reader overlaps a slow write and sees some new bits and some old bits Set 17: Fault-Tolerant Register Simulations CSCE 668

A Unary Approach Suppose the simulated register is to take on the values {0,…,K-1}. Use an array of K binary registers, B[0..K-1] represent value v by having B[v] = 1 and the other entries 0 Read algorithm: read B[0], B[1],…, until finding the first 1; return the index Write algorithm: zero out the old entry of B and set the new entry Set 17: Fault-Tolerant Register Simulations CSCE 668

Problems with Unary Approach
OK if reads and writes don't overlap. If they do, have to worry about reader never finding a 1 in B new-old inversion: writer writes 1, then 2, but reader reads 2, then 1. Counter-example execution on next slide since binary registers are linearizable, we just mark the linearization points of the reads and writes on the binary registers Set 17: Fault-Tolerant Register Simulations CSCE 668

Counter-Example Initially B[0] = B[1] = B[2] = 0 and B[3] = 1 read 2
from B[0] read 0 from B[1] read 1 from B[2] read 0 from B[0] read 1 from B[1] write 1 write 2 write 1 to B[1] write 0 to B[3] write 1 to B[2] write 0 to B[1] Set 17: Fault-Tolerant Register Simulations CSCE 668

Corrected Multi-Valued Algorithm
To prevent "falling off the edge" of the end of the B array without finding a 1, write algorithm only clears (sets to 0) entries that are smaller than the entry that is set (to 1) To prevent new-old inversions, read algorithm scans up to find first 1, and then scans down to make sure those entries are still 0 returns smallest value associated with a 1 entry in B that is observed during the downward scan Set 17: Fault-Tolerant Register Simulations CSCE 668

Multi-Valued Construction
B[0] 0/1 read write reader reader alg. writer alg. writer . B[K-1] read write 0/1 Set 17: Fault-Tolerant Register Simulations CSCE 668

Algorithm is Wait-Free
Algorithm for writer does not involve any waiting: just do at most K (low-level) writes Algorithm for reader does not involve any waiting: just do at most 2K-1 (low-level) reads. Set 17: Fault-Tolerant Register Simulations CSCE 668

Algorithm Ensures Linearizability
Describe an ordering of the (high-level) operations that is obviously legal (by the definition of the ordering) Then show that it respects real-time ordering of non- overlapping operations. Fix any admissible execution of the algorithm. Fix any linearization of the low-level operations (on the binary registers) exists since the execution is admissible, which implies the underlying communication system (the binary registers) behaves properly (is linearizable) Set 17: Fault-Tolerant Register Simulations CSCE 668

Reads-From Relations Low-level read r on a binary register B[v] reads from low-level write w on the register if w is the latest write to B[v] that precedes r in the linearization of the low- level operations. High-level read R on the simulated multi-valued register reads from high-level write W on the register if W returns v and W contains the low-level write that R's last read of B[v] reads from. Set 17: Fault-Tolerant Register Simulations CSCE 668

Reads-From Diagram low-level reads-from relationships
from B[1] read 0 from B[0] write 0 to B[0] write 1 to B[1] low-level reads-from relationships high-level reads-from relationship Set 17: Fault-Tolerant Register Simulations CSCE 668

Construct Permutation
Place all (high-level) writes in the order in which they occur no concurrent writes Consider each (high-level) read in the order in which they occur no concurrent reads Suppose read R reads from write W. Place R immediately before the write that follows W in the permutation. Set 17: Fault-Tolerant Register Simulations CSCE 668

Correctness of Permutation
Permutation is legal by construction each read is placed after the write that it reads from Why does it preserve order of non-overlapping operations? two writes: by construction a read that precedes a write in the execution: OK, since the read cannot read from a later write. Set 17: Fault-Tolerant Register Simulations CSCE 668

Lemma (10.1): Suppose (high-level) read R returns v; R reads B[u], with u < v, during its upward scan; and this read of B[u] reads from a (low-level) write contained in high-level write W1. Then R does not read from any (high-level) write that precedes W1. Set 17: Fault-Tolerant Register Simulations CSCE 668

Figure for Lemma 10.1 can't happen read v write v write w
from B[v] top of upward scan or during downward scan read 0 from B[u] during upward scan, u < v can't happen write 1 to B[v] write v write 1 to B[w] write 0 to B[u] write w low-level reads-from relationships high-level reads-from relationship Set 17: Fault-Tolerant Register Simulations CSCE 668

Two cases remain to show that real-time order of non-overlapping operations is preserved: a write that precedes a read in the execution two reads Proof of both cases are by contradiction and showing that there is a situation that violates Lemma Set 17: Fault-Tolerant Register Simulations CSCE 668

Multi-Reader from Single-Reader
First consider a simple idea: Use a different single-reader register for each reader (Val[1],…,Val[n]). n is number of readers Write algorithm: write the new value in each of the single-reader registers Read algorithm: read your own single-reader register and return that value Set 17: Fault-Tolerant Register Simulations CSCE 668

Counter-Example new-old inversion
Suppose 0 is initial value of multi-reader register. Suppose n = 2. pw p1 p2 write 1 write 1 to Val[1] write 1 to Val[2] read 1 from Val[1] read 0 from Val[2] new-old inversion Set 17: Fault-Tolerant Register Simulations CSCE 668

New Idea for Correct Algorithm
Have the multi-reader algorithm write some information to the single-reader registers to prevent new-old inversions on the simulated register. This is provably necessary… Set 17: Fault-Tolerant Register Simulations CSCE 668

Readers Must Write Theorem (10.3): In any wait-free simulation of a multi-reader single-writer register from single- reader single-writer registers, at least one reader must write. Proof: Suppose in contradiction there is an algorithm in which readers never write. Set 17: Fault-Tolerant Register Simulations CSCE 668

Readers Must Write pw is the writer, p1 and p2 are the readers
initial value of simulated register is 0 S1 is the set of single-reader registers that are read by p1 S2 is the set of single-reader registers that are read by p2 Set 17: Fault-Tolerant Register Simulations CSCE 668

Readers Must Write Consider execution in which pw writes 1 to the simulated register. The write algorithm performs a series of writes, w1,…,wk, to the single-reader registers. Each wj is a write to a register in either S1 or S2. Let vji be the value that would be returned if pi were to do a read immediately after wj. Set 17: Fault-Tolerant Register Simulations CSCE 668

Readers Must Write write 1 pw … … pi read vji write to w1 write to wj
to wk … … pi read vji Set 17: Fault-Tolerant Register Simulations CSCE 668

Readers Must Write a cannot equal b!
For each reader (p1 and p2), there is a point when the writes w1, …, wk cause the value of the simulated register, as it would be observed by that reader, to "switch" from 0 (old) to 1 (new). For p1: v11 = v21 = … = va-11 = 0 va1 = … = vk1= 1 For p2: v12 = v22 = … = vb-12 = 0 vb2 = … = vk2= 1 a cannot equal b! Set 17: Fault-Tolerant Register Simulations CSCE 668

Readers Must Write Why must a and b be different?
a marks the point when p1's view of the simulated register's current value changes from old to new. So wa must write to a register in S1. Similarly, wb must write to a register in S2. W.l.o.g., assume a < b. Set 17: Fault-Tolerant Register Simulations CSCE 668

Readers Must Write not linearizable! write 1 pw … … p1 read va1 = 1 p2
to w1 write to wa write to wa+1 write to wk … … p1 read va1 = 1 p2 read va2 = 0 not linearizable! Set 17: Fault-Tolerant Register Simulations CSCE 668

Readers Must Write Where did we use the assumption in this proof that readers don't write? The writer doing the slow write of 1 is oblivious to whether any readers are concurrently reading. The readers are oblivious to each other. Set 17: Fault-Tolerant Register Simulations CSCE 668

Corrected Multi-Reader Algorithm
As part of the algorithm for the read on the simulated register, announce the value to be returned. Before deciding what value to return, check what values have been returned by previous reads and don't pick anything earlier. Need timestamps to be able to determine relative age of returned values. Reader pi uses row i of a matrix to report its most recently returned value to all the other readers (remember, we only have single-reader variables at our disposal) Set 17: Fault-Tolerant Register Simulations CSCE 668

Writer's Algorithm get the next sequence number
use integers that are increased by one each time write value and sequence number to Val[1],…,Val[n] (one copy for each reader) Set 17: Fault-Tolerant Register Simulations CSCE 668

Reader pi's Algorithm read the value and timestamp written by the writer to Val[i] read the value and timestamp written by each reader to Report[j,i] choose the value-timestamp pair with the largest timestamp write that pair to row i of Report return value associated with that pair Set 17: Fault-Tolerant Register Simulations CSCE 668

Multi-Reader Construction
3 readers Val Report reader alg. reads writes writer alg. writer writes Set 17: Fault-Tolerant Register Simulations CSCE 668

Correctness of Multi-Reader Algorithm
Wait-free writer does n low-level writes reader does n+1 low-level reads and n low-level writes To prove linearizability, explicitly construct a permutation of the high-level operations that is clearly legal and then prove it preserves real-time order of non-overlapping operations. Set 17: Fault-Tolerant Register Simulations CSCE 668

Constructing the Permutation 
Put in all writes in the order in which they occur in the execution since single-writer, writes do not overlap Consider the reads in the order of their responses in the execution. read R reads from write W if W generates the timestamp associated with the value R returns place R immediately before the write that follows W By construction, the permutation is legal. Set 17: Fault-Tolerant Register Simulations CSCE 668

Preserving Real-Time Order
write-write: by construction of  read-write: Suppose R precedes W in . Then R cannot read from W or any succeeding write, so R is placed in  before W. write-read: Suppose W precedes R in . Then R reads W 's timestamp or a larger one from Val[ ] and reads from W or a later write. Thus R is placed in  after W. read-read: Suppose Ri by pi precedes Rj by pj in . Then pj reads Ri's timestamp or a larger one from Report[i,j]. So Rj reads from the same write that Ri reads from or a later write. Thus Rj is placed in  after Ri. Set 17: Fault-Tolerant Register Simulations CSCE 668

Multi-Writer from Single-Writer
Idea: each writer should announce each value it wants to write to all the readers, by writing the value to its own (SW,MR) register. each reader reads all the values written by the writers and returns the latest one How to determine latest value? use timestamps new wrinkle is that multiple processes generate timestamps, need to coordinate Set 17: Fault-Tolerant Register Simulations CSCE 668

Using Vector Timestamps
Data structure VT at each proc consisting of a vector of m integers m is the number of writers To get a new timestamp, writer pi increments VT[i] by one To compare timestamps, use lexicographic order This is a total order that extends the partial order defined for vector timestamps Set 17: Fault-Tolerant Register Simulations CSCE 668

Writer pw's Algorithm get the next vector timestamp:
read the timestamp written by each writer to TS[0],…,TS[m-1] extract the i-th entry of each TS[i] increment my own entry by 1 write my new timestamp to TS[w] write value and timestamp to Val[w] Set 17: Fault-Tolerant Register Simulations CSCE 668

Reader pr's Algorithm read the value and timestamp written by each writer to Val[0], …, Val[m-1] choose the value-timestamp pair with the largest timestamp return value associated with that pair Set 17: Fault-Tolerant Register Simulations CSCE 668

Multi-Writer Construction
2 writers 3 readers TS reader alg. writer alg. Val writer alg. write read Set 17: Fault-Tolerant Register Simulations CSCE 668

Correctness of Multi-Writer Algorithm
Wait-free writer does m low-level reads and 2 low-level writes reader does m low-level reads To prove linearizability, explicitly construct a permutation of the high-level operations that is clearly legal and then prove it preserves real-time order of non-overlapping operations. Set 17: Fault-Tolerant Register Simulations CSCE 668

Constructing the Permutation 
Put in all writes in timestamp order Lemma 10.6 shows this preserves order of non-overlapping writes Consider the reads in the order of their responses in the execution. read R reads from write W if W generates the timestamp associated with the value R returns place R immediately before the write that follows W By construction, the permutation is legal. Set 17: Fault-Tolerant Register Simulations CSCE 668

Preserving Real-Time Order
write-write: by construction of  read-write: Suppose R precedes W in . Then R cannot read from W or any succeeding write, so R is placed in  before W. write-read: Suppose W precedes R in . Then R reads W 's timestamp or a larger one from Val[ ] and reads from W or a later write. Thus R is placed in  after W. read-read: Suppose Ri by pi precedes Rj by pj in . By Lemmas 10.6 and 10.7, pj reads Ri's timestamp or a larger one from Val[ ]. So Rj reads from the same write that Ri reads from or a later write. Thus Rj is placed in  after Ri. Set 17: Fault-Tolerant Register Simulations CSCE 668

Atomic Snapshot Objects (ASO)
An array of elements: each one can be updated by just one proc. a proc. can scan the whole array "atomically" Useful abstraction for designing shared memory algorithms Can be wait-free implemented from read/write variables (registers) Set 17: Fault-Tolerant Register Simulations CSCE 668

ASO Sequential Specification
Operations are invocation scani, response returni(V) where V is an array of n values, 0 ≤ i ≤ n-1 invocation updatei(d) where d is a data value, response acki, 0 ≤ i ≤ n-1 Legal sequences: for each V returned by a scan, V[i] equals parameter of latest preceding updatei an atomic snapshot of entire array Set 17: Fault-Tolerant Register Simulations CSCE 668

ASO Example Suppose array = [a,b,c] initially. This sequence is legal:
update1(x), update2(y), scan([a,x,y]), update0(z), scan([z,x,y]) This sequence is not legal: update1(x), update2(y), update0(z), scan([a,x,y]) Set 17: Fault-Tolerant Register Simulations CSCE 668

Sketch of Wait-Free Implementation Using Registers
Store each array entry ("segment") in a different read/write variable Update algorithm: write to the variable holding that segment Scan algorithm: Collect (read) all the values in the segments twice If no segment is updated during the "double collect", then we got a valid snapshot -- return it Issues: how to tell if a segment is updated? what to do if a segment is updated? Set 17: Fault-Tolerant Register Simulations CSCE 668

Detecting Updates Simple idea is to tag each value stored in a segment with a counter (1,2,3,…) requires unbounded space More complex, bounded-space, solution is given in the textbook uses a "handshaking" mechanism Set 17: Fault-Tolerant Register Simulations CSCE 668

Reacting to Update During Scan
If a scanner observes enough changes to a particular segment, then one of the overlapping updaters has performed a complete update during this scan Embed a scan at the beginning of each update: the view obtained in this scan is written with the data to the segment Scanner returns view obtained in last collect Set 17: Fault-Tolerant Register Simulations CSCE 668

Complexity of ASO Algorithm
Number of building-block read/write variables is O(n) (although some are large) Scan algorithm uses O(n2) low-level reads and writes. Update algorithm uses O(n2) low-level reads and writes. Set 17: Fault-Tolerant Register Simulations CSCE 668

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Similar presentations

Presentation on theme: "CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Similar presentations

Presentation on theme: "CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS"— Presentation transcript:

Similar presentations

About project

Feedback