Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing with Byzantine Shared Memory Topics in Reliable Distributed Systems Fall 2004-2005, Idit Keidar.

Similar presentations

Presentation on theme: "Computing with Byzantine Shared Memory Topics in Reliable Distributed Systems Fall 2004-2005, Idit Keidar."— Presentation transcript:

1 Computing with Byzantine Shared Memory Topics in Reliable Distributed Systems Fall 2004-2005, Idit Keidar

2 Sources Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory –Abraham, Chockler, Keidar, Malkhi, PODC04 Optimal Resilience Wait-Free Storage from Byzantine Components: Inherent Costs and Solutions –Chockler, Keidar, Malkhi, FuDiCo II, June 2004. Some slides stolen from Chockler and Malkhi

3 Internet Information Services Internet

4 Internet Information Services web Network storage [ Fleet (Hebrew U and Lucent) Agile (GTech), Coca (Cornell), SBQL (UTexas), others]

5 Internet Information Services web Peer-to-peer storage [ Oceanstore, etc.]

6 Storage Area Network (SAN)

7 Large-Scale Deployment Clients come and go –cannot be expected to be around –hence, direct communication infeasible Communication among storage nodes is also infeasible –In SAN impossible –Thin (scalable) servers, more logic at clients Some of the storage nodes can be compromised by a malicious adversary

8 Fault/Security Scenarios Client crash, unresponsiveness Access control for clients –access restricted to data they own –if bypassed, what can we do? A malicious client can mess up the data anyway Servers entrusted with others’ data –compromised servers a bigger problem –Byzantine faults possible

9 Assumptions Any number of client failures (crash) –no Byzantine faults (assume access control) Threshold (t-out-of-n) of storage components (per object) can be faulty –Byzantine, unresponsive: NR-Arbitrary Faults Monitoring/reconfiguration service –Each object has enough healthy replicas Naming service: e.g., DHT

10 Example: Internet-Wide Store NS Client: ha=lookup(a); read(ha); write(ha); … X Y Z A ha={X,Y,Z,A}

11 Data-Centric Replication No server-to-server comm NR-arbitrary failures No client-to-client comm Bounded, unknown number of clients

12 Some System Examples Fleet, HUJI and Lucent Rosebud, MIT Agile, GTech Coca, Cornell SBQL, UTexas OceanStore, Berkley Pasis, CMU others

13 Our Focus: Foundations Wanted: generic services useful for applications –Reliable R/W registers –Consensus Which register semantics can be supported? –Atomic, regular, safe –Termination conditions At which cost? –Failure resilience –Communication rounds –Memory consumption By understanding tradeoffs, system designers will be able to intelligently decide what to choose

14 Formal Model Asynchronous shared memory –servers=shared objects, clients=processes Shared objects may experience NR- arbitrary failures [JCT98] –faulty object can respond with arbitrary value or fail to respond Processes can fail by crashing

15 Previous Work I: Wait-Free Constructions Safe register: A read that does not overlap a write returns the last register’s value –Malkhi & Reiter 98: n > 4t One round read/write, unbounded timestamps –Jayanti, Chandra & Toueg 98: n > 5t One round read/write, no timestamps Self-construction: implement safe register from collection of fault-prone safe registers

16 Byzantine Quorum Systems: Example [Malkhi and Reiter 98] At most one server can be penetrated x = 7, t = 1 x = 7 x = 0 t = 0 x = 2 t = 5 x = 7 t = 1 x = 7 t = 1

17 Byzantine Quorum Systems: Example [Malkhi and Reiter 98] x = 7, t = 1 x = 7 x = 0 t = 0 x = 0 t = 0 x = 7 t = 1 x = 7 t = 1 Why timestamps?

18 Previous Work II: Optimal Failure Resilience (n>3t) Synchronous system [Bazzi, DC’00] –Safe register Servers communicate with each other (finish operations for faulty clients) [Martin, Alvisi & Dahlin, DISC’02] –MWMR atomic register Reliable clients [Attiya & Bar-Or, SRDS’03]  Missing: optimal-resilience data-centric asynchronous implementations resilient to process failures

19 Optimal Resilience: The Challenge (v, 1) The writerThe reader (v 0,0) delayed n=4, t=1

20 Optimal Resilience: The Challenge ack The writerThe reader (v,1) (v 0,0)

21 Optimal Resilience: The Challenge The writerThe reader (v,1) (v 0,0) delayed

22 Optimal Resilience: The Challenge The writerThe reader (v,1) (v 0,0) (v,1) ? Cannot return v 0

23 Optimal Resilience: The Challenge The writerThe reader (v 0,0) (v,1) ? Cannot return v 1 (v 0,0) No write happened

24 Reliable Writer Solution: Wait The writerThe reader (v,1) (v 0,0) (v,1)

25 Faulty Writer Scenario The writerThe reader Cannot wait! (v, 1) (v 0,0) (v,1) ?

26 What Does This Mean? Is a solution with n=3t+1 impossible?

27 Lower Bound for Optimal Resilience The reader cannot return any value, and cannot wait for more, so what can it do? –Invoke more rounds! Will this help? –No! Exactly the same thing can happen in every round. Conclusion?

28 Write Lower Bound For 1W1R binary safe register (weakest meaningful object type) construction from any base object type if n ≤ 4t and processes can crash emulating Write in one round is impossible To emulate Write, there must be at least one base object on which two operations are invoked

29 Two Rounds Save the Day! (v, 1) The writerThe reader pw=v 0,0 w=v 0,0 pw=v 0,0 w=v 0,0 pw=v 0,0 w=v 0,0 pw=v,1 w=v 0,0 pw=v,1 w=v 0,0 pw=v 0,0 w=v 0,0 pw=v,1 w=v,1 pw=v,1 w=v,1 pw=v 0,0 w=v 0,0

30 Writer Fails During Pre-Write The writerThe reader pw=v,1 w=v 0,0 pw=v 0,0 w=v 0,0 pw=v 0,0 w=v 0,0 Can return v 0 (v, 1)

31 Writer Fails During Write The writerThe reader pw=v,1 w=v,1 pw=v,1 w=v 0,0 pw=v 0,0 w=v 0,0 Can wait to hear more v,1 (v, 1)

32 Write Never Happened The writerThe reader pw=v 0,0 w=v 0,0 pw=v 0,0 w=v 0,0 pw=v 0,0 w=v 0,0 pw=v,1 w=v,1 Can wait to hear more without v,1

33 Can Read Always Complete in One Round?

34 Overlapping Write Scenario The writerThe reader pw=v,1 w=v 0,0 pw=v 0,0 w=v 0,0 pw=v 0,0 w=v 0,0 (v, 1) pw=v,1 w=v 0,0 pw=v,1 w=v 0,0 pw=v,1 w=v,1 (v 0,0) (v,1) Cannot wait

35 Solution? More read rounds By the next round, all objects responding to the Read will have seen the first phase of the Write –By causality But what if there are additional overlapping writes? –Safe register implementation can deduce in the next round (in general, in min(f+2,t+1) rounds) that there is an overlapping write; returns arbitrary value

36 Determining Value to Return Read values (from w field) are candidates to return If there are 2t+1 responses without v (in either pw or w) then v is no longer a candidate –Either was not written before Read, or some Write overlaps read If the highest timestamped candidate occurs t+1 times, it can be returned If the candidates set is empty, return any value –There must be an overlapping Write Each round, wait for at least one more response than in previous rounds After min(t+1,f+2) rounds, it will be possible to return

37 Summary: Wait-Free Safe Register Construction for n>3t Invokes a bounded number of Read rounds –Optimal round complexity Constant storage Unbounded timestamps Read takes two rounds (after synchronization) in (eventually) synchronous runs

38 Summary: Lower Bounds for n>3t and Fault-Prone Clients 1W1R binary safe register construction: –WRITE: 2 communication rounds are necessary irrespective of base object types Termination: obstruction freedom –READ: min(t+1,f+2) communication rounds are necessary irrespective of base object types if readers do not modify base objects Termination: Every write terminates, read terminates if eventually runs in isolation (Finite- Write Termination)

39 A Note on Safe Registers Wait-free safe registers are too weak to be directly useful by applications –They are good for lower bounds Bounded constructions of wait-free atomic registers from safe registers known, but…  Complicated  Add communication rounds and memory  Too costly to deploy in a distributed setting

40 Registers with Stronger Semantics Can we come up with an efficient, direct construction of a regular/atomic register? Yes, if we are ready to compromise wait freedom

41 Termination Conditions Wait-Freedom: –Every operation must complete within a finite number of steps regardless of steps of other processes Lock-Freedom: –If there are concurrent operations, at least one of them must complete Obstruction- Freedom: –If one process runs alone (without interference) for sufficiently long, it must complete its operation

42 Notes Wait-freedom often hard to achieve Lock-freedom and obstruction-freedom are achieved by simpler algorithms

43 What is the Right Condition? Ultimately, we want to solve consensus –For state machine replication –Want it to be wait-free Alas, impossible in asynchronous systems We need a leader-oracle in order to solve consensus in asynchronous shared memory –ensures that at most one process writes (proposes values), while others only read

44 Reminder: Disk Paxos forever() if leader (by leader oracle) then b ← choose new unique rank write  b,v,n,  to xi read all xj’s if some xj.bal > b then continue if all read xj.val’s = then v ← my initial value else v ← read val with highest num n ← b write  b,v,n,  to xi read all xj’s if some xi.bal >b then continue write  b,v,n,v  to xi return v else /* non-leader */ read x ld where ld is leader if x ld.dec ≠ bot then return x ld.dec Observation: Only leader writes; others just read. Leader stops writing upon deciding

45 What is the Right Condition? (Cont’d) Will Lock-Freedom help? –No, because as long as the leader does not succeed to write, no decision is possible Is Wait-Freedom really needed?

46 Finite-Writes Termination Finite-Writes Termination (FW-termination) –Every write (by a correct process) eventually returns –Every read (by a correct process) eventually returns unless infinitely many writes are invoked Wait-free consensus with an  leader oracle is solvable with FW-terminating regular 1WMR registers

47 Finite-Writes Termination

48 Consensus w/ FW-Terminating Registers We observe that existing wait-free shared- memory failure-detector-based (  ) consensus algorithms* work with FW- Terminating registers –Lo and Hadzilacos’96 –Gafni and Lamport’00 *with small modifications

49 Back to the Register Emulation To implement a regular FW-Termination register, Read keeps invoking rounds until the highest timestamped value appears in t+1 responses When overlapping Writes stop occurring, Read eventually returns –Number of rounds unbounded by construction

50 The Complete System

51 Conclusions Optimally resilient R/W register constructions out of Byzantine-fault-prone base objects Lower bounds –2 rounds for write; min(t+1,f+2) rounds Matching wait-free safe construction Simple and direct construction of FW-terminating regular registers –Efficient in synchronous runs Wait-free  -based Consensus is implementable with FW-terminating regular registers

Download ppt "Computing with Byzantine Shared Memory Topics in Reliable Distributed Systems Fall 2004-2005, Idit Keidar."

Similar presentations

Ads by Google