View Change Protocols and Reconfiguration

Slides:

Advertisements

Similar presentations

Replication techniques Primary-backup, RSM, Paxos Jinyang Li.

Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.

CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.

The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.

Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

CS 582 / CMPE 481 Distributed Systems

Fault-tolerance techniques RSM, Paxos Jinyang Li.

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

EEC 688 Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Byzantine fault tolerance

Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.

Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.

From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.

Byzantine fault tolerance

Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.

Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.

Fault Tolerant Services

Consensus and leader election Landon Cox February 6, 2015.

EEC 688/788 Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.

Primary-Backup Replication COS 418: Distributed Systems Lecture 5 Kyle Jamieson.

Consensus 2 Replicated State Machines, RAFT

Primary-Backup Replication

CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.

Replication State Machines via Primary-Backup

COS 418: Distributed Systems

View Change Protocols and Reconfiguration

EECS 498 Introduction to Distributed Systems Fall 2017

Replication and Consistency

EECS 498 Introduction to Distributed Systems Fall 2017

Strong consistency and consensus

Providing Secure Storage on the Internet

EECS 498 Introduction to Distributed Systems Fall 2017

Outline Announcements Fault Tolerance.

Distributed Systems, Consensus and Replicated State Machines

EEC 688/788 Secure and Dependable Computing

Distributed Systems CS

Fault-tolerance techniques RSM, Paxos

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

From Viewstamped Replication to BFT

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

IS 651: Distributed Systems Fault Tolerance

EEC 688/788 Secure and Dependable Computing

Consensus, FLP, and Paxos

Scalable Causal Consistency

Replication State Machines via Primary-Backup

EEC 688/788 Secure and Dependable Computing

EECS 498 Introduction to Distributed Systems Fall 2017

Causal Consistency and Two-Phase Commit

Replication State Machines via Primary-Backup

Our Final Consensus Protocol: RAFT

Replicated state machine and Paxos

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

The SMART Way to Migrate Replicated Stateful Services

EEC 688/788 Secure and Dependable Computing

Distributed Systems CS

Sisi Duan Assistant Professor Information Systems

Presentation transcript:

View Change Protocols and Reconfiguration COS 418: Distributed Systems Lecture 9 Wyatt Lloyd

Housekeeping Midterm finally scheduled Final also scheduled 10/24, 7-9pm, Computer Science 104 Talk to me after if you have a conflict Final also scheduled 1/23, 730pm, Friend Center 101 Assignment 2 due Thursday Where I was last week Global tables in DynamoDB!

Today More primary-backup replication View changes Reconfiguration

Review: Primary-Backup Replication Nominate one replica primary Clients send all requests to primary Primary orders clients’ requests Clients shl Logging Module State Machine Logging Module State Machine Log Log Servers add jmp mov shl add jmp mov shl

From Two to Many Replicas Clients shl Logging Module State Machine Logging Module State Machine Logging Module State Machine Log Log Log Servers add jmp mov shl add jmp mov shl add jmp mov shl Last time: Primary-Backup case study Today: State Machine Replication with many replicas Survive more failures

Intro to “Viewstamped Replication” State Machine Replication for any number of replicas Replica group: Group of 2f + 1 replicas Protocol can tolerate f replica crashes Viewstamped Replication Assumptions: Handles crash failures only Replicas fail only by completely stopping Unreliable network: Messages might be lost, duplicated, delayed, or delivered out-of-order

... Replica State configuration: identities of all 2f + 1 replicas In-memory log with clients’ requests in assigned order ... ⟨op1, args1⟩ ⟨op2, args2⟩ ⟨op3, args3⟩ ⟨op4, args4⟩

Normal Operation Client A (Primary) B C (f = 1) Request Prepare PrepareOK Reply Client Execute A (Primary) B C Time  Primary adds request to end of its log Replicas add requests to their logs in primary’s log order Primary waits for f PrepareOKs  request is committed Discuss why f instead of 2f here provides the fault tolerance!

Normal Operation: Key Points (f = 1) Request Prepare PrepareOK Reply Client Execute A (Primary) B C Time  Protocol provides state machine replication On execute, primary knows request in f + 1 = 2 nodes’ logs Even if f = 1 then crash, ≥ 1 retains request in log

Piggybacked Commits Client A (Primary) B C (f = 1) Request Prepare PrepareOK Reply Client +Commit previous Execute A (Primary) B C Time  Previous Request’s commit piggybacked on current Prepare No client Request after a timeout period? Primary sends Commit message to all backups

The Need For a View Change So far: Works for f failed backup replicas But what if the f failures include a failed primary? All clients’ requests go to the failed primary System halts despite merely f failures

Today More primary-backup replication View changes Reconfiguration With Viewstamped Replication Using a View Server Reconfiguration

Views P P P Let different replicas assume role of primary over time System moves through a sequence of views View = (view number, primary id, backup id, ...) P P View #3 P View #1 View #2

View Change Protocol Backup replicas monitor primary If primary seems faulty (no Prepare/Commit): Backups execute the view change protocol to select new primary View changes execute automatically, rapidly Need to keep clients and replicas in sync: same local state of the current view Same current view at replicas Same current view at clients

Correctly Changing Views View changes happen locally at each replica Old primary executes requests in the old view, new primary executes requests in the new view Want to ensure state machine replication So correctness condition: Executed requests Survive in the new view Retain the same order in the new view

Replica State (for View Change) configuration: sorted identities of all 2f + 1 replicas In-memory log with clients’ requests in assigned order view-number: identifies primary in configuration list status: normal or in a view-change

View Change Protocol B (New Primary) C (f = 1) Start-View-Change view # Do-View-Change log Start-View log (!) ++view # B (New Primary) C Time  B notices A has failed, sends Start-View-Change C replies Do-View-Change to new primary, with its log B waits for f replies, then sends Start-View On receipt of Start-View, C replays log, accepts new ops

View Change Protocol: Correctness (f = 1) Execute A (Old Primary) Start-View-Change view # Do-View-Change log Start-View log B (New Primary) C PrepareOK Time  Executed request, previous view Old primary A must have received one or two PrepareOK replies for that request (why?) Request is in B’s or C’s log (or both): so it will survive into new view

Principle: Quorums (f = 1) et cetera Any group of f + 1 replicas is called a quorum Quorum intersection property: Two quorums in 2f + 1 replicas must intersect in at least one replica

Applying the Quorum Principle Normal Operation: Quorum that processes one request: Q1 ...and 2nd request: Q2 Q1 ∩ Q2 has at least one replica  Second request reads first request’s effects

Applying the Quorum Principle View Change: Quorum processes previous (committed) request: Q1 ...and that processes Start-View-Change: Q2 Q1 ∩ Q2 has at least one replica  View Change contains committed request

Split Brain Client 1 A (Primary) Network partition B (New Primary) C (not all protocol messages shown) Request Request Client 1 Execute Execute A (Primary) Network partition Execute Request Execute Request B (New Primary) Start-View Start-VC C Client 2 What’s undesirable about this sequence of events? Why won’t this ever happen? What happens instead?

Today More primary-backup replication View changes Reconfiguration With Viewstamped Replication Using a View Server Reconfiguration

Would Centralization Simplify Design? A single View Server could decide who is primary Clients and servers depend on view server Don’t decide on their own (might not agree) Goal in designing the View Server: Only one primary at a time for correct state machine replication

View Server Protocol Operation For now, assume View Server never fails Each replica periodically pings the View Server VS declares replica dead if missed N pings in a row VS considers replica alive after a single ping received Problem: Replica can be alive but because of network connectivity, be declared “dead”

View Server: Split Brain (1, S1, S2) S1 View Server S2 (1, S1, S2) (2, S2, −) (2, S2, −) Client

One Possibility: S2 in Old View View Server S2 (1, S1, S2) (2, S2, −) (2, S2, −) (1, S1, S2) (2, S2, −) (1, S1, S2) Client

Also Possible: S2 in New View View Server S2 (1, S1, S2) (2, S2, −) (2, S2, −) (1, S1, S2) (2, S2, −) Client

Split Brain and View Changes Take-away points: Split Brain problem can be avoided both: In a decentralized design (Viewstamped Replication) With centralized control (View Server) But protocol must be designed carefully so that replica state does not diverge

Today More primary-backup replication View changes Reconfiguration

The Need for Reconfiguration What if we want to replace a faulty replica with a different machine? For example, one of the backups may fail permanently What if we want to change the replica group size? Decommission a replica Add another replica (increase f, possibly) Protocol that handles these possibilities is called the reconfiguration protocol

Replica State (for Reconfiguration) configuration: sorted identities of all 2f + 1 replicas In-memory log with clients’ requests in assigned order view-number: identifies primary in configuration list status: normal or in a view-change epoch-number: indexes configurations

Reconfiguration (1) Primary immediately stops accepting new requests Prepare PrepareOK Client new-config A (Primary) B C (remove) D (add) Time  Primary immediately stops accepting new requests

Reconfiguration (2) Primary immediately stops accepting new requests Reply Client new-config Prepare, PrepareOK A (Primary) B C (remove) D (add) Time  Primary immediately stops accepting new requests No up-call to RSM for executing this request

Reconfiguration (3) Primary sends Commit messages to old replicas Reply Client new-config StartEpoch Prepare, PrepareOK A (Primary) B Commit C (remove) D (add) Time  Primary sends Commit messages to old replicas Primary sends StartEpoch message to new replica(s)

Reconfiguration in New Group {A, B, D} Reply EpochStarted Client new-config StartEpoch Prepare, PrepareOK A (Primary) B Commit C (remove) D (add) Time  Update state with new epoch-number Fetch state from old replicas, update log Send EpochStarted msgs to replicas being removed

Reconfiguration at Replaced Replica {C} Reply EpochStarted Client new-config StartEpoch Prepare, PrepareOK A (Primary) B Commit C (remove) D (add) Time  Respond to state transfer requests from others Waits until it receives f’ + 1 EpochStarteds, f’ is fault tolerance of new epoch Send StartEpoch messages to new replicas if they don’t hear EpochStarted (not shown above)

Shutting Down Old Replicas If admin doesn’t wait for reconfiguration to complete and decommissions old nodes, may cause > f failures in old group Can’t shut down replicas on receiving Reply at client Must ensure committed requests survive reconfiguration! Fix: A new type of request CheckEpoch reports the current epoch Goes thru normal request processing (again no upcall) Return indicates reconfiguration is complete Q: Why not have reconfigure wait for this to complete? Return indicates reconfig is complete because the reply to the reconfiguration message told the admin that a quorum of nodes in the old system were starting a new epoch and no longer accepting requests. So the only for a new request to be processed is in the new epoch!

Conclusion: What’s Useful When Backups fail or has network connectivity problems? Minority partitioned from primary? Quorums allow primary to continue Primary fails or has network connectivity problems? Majority partitioned from primary? Rapidly execute view change Replica permanently fails or is removed? Replica added?  Administrator initiates reconfiguration protocol