Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department

Similar presentations


Presentation on theme: "UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department"— Presentation transcript:

1 UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Fail-Stop Processors Byzantine Generals in Action: Implementing Fail-Stop Processors, Fred Schneider, TOCS, May 1984 Example usage of byzantine agreement Why fail-stop processors can simplify replicated services Why fail-top processors are expensive (impractical?) to build Remaining Time: Byzantine Werewolves (improved?)

2 Motivation Goal: Build systems that continue to work in presence of component failure Difficulty/cost of building those systems depends upon how components can fail Fail-stop components make building reliable systems easier than components with byzantine failures

3 Fail-Stop Processors What is a failure? What is a Byzantine failure?
Output (or behavior) that is inconsistent with specification What is a Byzantine failure? Arbitrary, even malicious, behavior Components may collude with each other Cannot necessarily detect output is faulty What is a fail-stop processor? Halts instead of performing erroneous transformations Others can detect halted state Others can access uncorrupted stable storage even after failure

4 Questions to Answer What are the advantages of fail-stop processors?
2) Real processors are not fail-stop Can we build one? How can we build an approximation of one? 3) Approximations of fail-stop processors are expensive to build Under what circumstances is replicated service with fail-stop processors “better”?

5 1) Distributed State Machine
Common approach for building a reliable system Idea: Replicate faulty servers, coordinate client interactions with replicas input sequence State machine Client Byzantine agreement R R R Combine outputs output T-fault tolerant: Satisfies specification as long as no more than t components fail Failure model of components determines how many replicas, R, are needed and their interactions

6 How to build t-fault tolerant state machine?
Inputs Key: All replicas receive and process same sequence of inputs 1) Agreement: Every nonfaulty replica receives same request (interactive consistency or byzantine agreement) 2) Ordering: Every nonfaulty replica processes requests in same order (logical clocks) Outputs Byzantine Fail-Stop Combine output? majority any Number of replicas? 2t+1 t+1

7 2) Building a Fail-Stop Processor
Must provide stable storage Volatile: Lost on failure Stable Not affected (lost or corrupted) by failure Can be read by any processor Benefit: Recover work of failed process Drawback: Minimize interactions since slow Can only build approximation of fail-stop processor Finite hardware -> Finite failures could disable all error detection hardware k-fail-stop processor: behaves fail-stop unless k+1 or more failures

8 Implementation of k-FSP: Overview
Two components k+1 p-processes (program) 2k+1 s-processes (storage) Each process runs on own processor, all connected with network P-Processes (k+1) Each runs program for state machine Interacts with s-processes to read and write data If any fail (if any disagreement), then all STOP Cannot necessarily detect k+1 failures S-Processes (2k+1) Each replicates contents of stable storage for this FSP Provides reliable data with k failures (cannot just stop) Detects disagreements/failures across p-processes How???

9 Interactive Consistency Requirements
IC1. If nonfaulty p-process, then every nonfaulty s-process receives request within  seconds (as measured on s-process clock) IC2. Non-faulty s-processes in same k-FSP agree on every request from p-process j S-processes must agree even when p-process is faulty To provide IC1 and IC2: Assuming can authenticate sender of messages, use signed message (SM) protocol for byzantine agreement Need just k+1 processes for agreeement IC3. For each k-FSP, clocks of all p-processes are synchronized All non-faulty p-processes must send requests at same time to s-processes

10 FSP Algorithm Details: Writes
Each p-process, on a write: Broadcast write to all s-processes Byzantine agreement across all s-processes (all s-processes must agree on same input value from particular p-process) Each s-process, on a write (Fig 1): Ensure each p-process writes same value and receive within time bound Initial code: Handle messages after at least time  has transpired since receipt (every s-process should receive by then) If receive write request from all k+1 p-processes (|M| = k+1), then update value in stable storage If not, then halt all p-processes Set failed variable to true Do not allow future writes

11 FSP Algorithm Details: Reads
Each p-process, on a read: Broadcast request to all s-processes Use result from majority (k+1 out of 2k+1) Can read from other FSPs as well Useful if FSP failed and re-balancing work Each p-process, determine if halted/failed: Read failed variable from s-process (use majority)

12 FSP Example k=2, SM code: “b=a+1”; How many p and s processes? p: 1 2 3 4 s: a: 6 b: failed: 0 How do p-processes read a? Broadcast request to each s-process 2) Each s-process responds to read request 3) Each p-process uses majority of responses from s-process

13 FSP Example p: 1 2 3 4 s: How do p-processes read a?
k=2, SM code: “b=a+1” p: 1 2 3 4 s: a: b: failed: How do p-processes read a? What if 2 s-processes fail? E.g., think a=5? What if 3 s-processes fail?

14 FSP Example p: 1 2 3 4 s: How do p-processes write b?
k=2, SM code: “b=a+1” p: 1 2 3 4 s: a: b: failed: How do p-processes write b? Each p-process j performs byzantine agreement using signed message protocol SM(2) across s-processes Each s-process must agree on what p-process j is doing, even if j is faulty Each s-process looks at requests after time delta elapsed If see same write from all k+1 processes, perform write Otherwise, halt all p-processes; forbid future writes

15 FSP Example p: 1 2 3 4 s: How do p-processes write b?
k=2, SM code: “b=a+1” p: 1 2 3 4 s: a: b: failed: How do p-processes write b? What if 1 p-process (or network) is very slow? What if 1 p-process gives incorrect request to all s-processes? What if 1 p-process gives incorrect request to some? Byzantine agreement catches: All s-processes agree that p-process is faulty (giving different requests); agree to treat it similarly When see doesn’t agree with other p-processes, will halt What if 3 p-processes give bad result?

16 3) Higher-Level Example
Goal: Service handling k faults; N nodes for performance Solution: Use N+k k-failstop processors Example: N=2, k=3 FSP0 FSP1 FSP2 FSP3 FSP4 SS0 SS1 SS2 SS3 SS4 What happens if: 3 p-processes in FSP0 fail? 4 p-processes in FSP0 fail? 1 p-process in FSP0, FSP1, and FSP2 fail? also in FSP3? 2 p-processes in FSP0, FSP1, and FSP2 fail? 1 s-process in SS0 fails? also in SS1, SS2, and SS3? 4 s-processes in SS0 fail?

17 Should we use Fail Stop Processors?
Metric: Hardware cost for state machines: Fail-stop components: Worst-case (assuming 1 process per processor): (N+k) * [2k+1 + k+1] = (N+k) * (3k+2) processors Best-case (assuming s-processes from different FSP share same processor) (N+k)(k+1) + (2k+1) processors Byzantine components: N * (2k+1) Fail-stop can be better if s-processes share and N>k… Metric: Frequency of byzantine agreement protocol Fail-Stop: On every access to stable storage Byzantine: On every input read Probably fewer input reads

18 Summary Why build fail-stop components? Why not?
Easier for higher layers to model and deal with Matches assumptions of many distributed protocols Why not? Usually more hardware Usually more agreements needed Higher-levels may be able to cope with “slightly faulty” components Violates end-to-end argument Conclusion: Probably shouldn’t assume fail-stop components

19 Byzantine Werewolves Previous :Too easy for villagers to identify werewolves Villager A had reliable information that Z was werewolf Villager B could validate that A was villager Hard for Z to lie that C was werewolf, because D could have checked C too Signed Protocol: Many could hear what one said Difficult for werewolves to tell different lies to others Have to tell everyone same thing New Changes to give more advantage to werewolves: Unknown number of werewolves (1 <= w < 1/2 N) Night: Werewolves convert multiple villagers to wolves (1 <= v <= w) Key: Info told by moderator will then be stale and wrong! Day: Villagers can vote to lynch multiple victims

20 Byzantine-Werewolf Game Rules
Everyone secretly assigned as werewolf or villager W werewolves, rest are “seeing” villagers I am moderator Night round (changed order): “Close your eyes”; make noises with one hand to hide activity For all: “NAME, open your eyes” “Pick someone to ask about” Useless for Werewolves, but hides their identity… Point to another player Moderator signs thumbs up for werewolf, down for villager “NAME, close your eyes” “Werewolves, open your eyes”: W can see who is who “Werewolves, pick villagers to convert” Moderator picks secret number between 1 and W Silently agree on villagers by pointing Moderator taps converts on shoulder; should open eyes to see other werewolves “Werewolves, close your eyes”

21 Rules: Day Time Day Time: “Everyone open your eyes; its daytime”
Agreement time: Everyone talks and votes on who should be “decommissioned” Villagers try to decommission werewolves Werewolves try to trick villagers with bad info Someone must propose who should be killed Vote until kill villager or no more proposals or no majority Werewolves really spread at night, so large incentive to kill as many as possible now Moderator: Uses majority voting to determine who is decommissioned “Okay, NAME is dead” Person is out of game (can’t talk anymore) and shows card Repeat cycle until All werewolves dead OR werewolves >= villagers


Download ppt "UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department"

Similar presentations


Ads by Google