Replication techniques Primary-backup, RSM, Paxos Jinyang Li.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Categories of I/O Devices
Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
CS 542: Topics in Distributed Systems Diganta Goswami.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
NETWORK ALGORITHMS Presenter- Kurchi Subhra Hazra.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Consensus Hao Li.
The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Two phase commit. What we’ve learnt so far Sequential consistency –All nodes agree on a total order of ops on a single object Crash recovery –An operation.
Strong Consistency and Agreement COS 461: Computer Networks Spring 2011 Mike Freedman 1 Jenkins,
Fault-tolerance techniques RSM, Paxos Jinyang Li.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Byzantine fault tolerance
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Lecture 11-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 28, 2010 Lecture 11 Leader Election.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Primary-Backup Replication COS 418: Distributed Systems Lecture 5 Kyle Jamieson.
Primary-Backup Replication
The consensus problem in distributed systems
Distributed Systems – Paxos
Alternative system models
Two phase commit.
Replication State Machines via Primary-Backup
View Change Protocols and Reconfiguration
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
Implementing Consistency -- Paxos
Outline Announcements Fault Tolerance.
Replication Improves reliability Improves availability
Fault-tolerance techniques RSM, Paxos
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
IS 651: Distributed Systems Fault Tolerance
Consensus, FLP, and Paxos
Lecture 21: Replication Control
Replication State Machines via Primary-Backup
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
Replicated state machine and Paxos
View Change Protocols and Reconfiguration
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Paxos Made Simple.
Lecture 21: Replication Control
Implementing Consistency -- Paxos
IS 698/800-01: Advanced Distributed Systems State Machine Replication
Presentation transcript:

Replication techniques Primary-backup, RSM, Paxos Jinyang Li

Fault tolerance => replication How to recover a single node from power failure? –Wait for reboot Data is durable, but service is unavailable temporarily –Use multiple nodes to provide service Another node takes over to provide service

Replicated state machine (RSM) RSM is a general replication method –Lab 6: apply RSM to lock service RSM Rules: –All replicas start in the same initial state –Every replica apply operations in the same order –All operations must be deterministic All replicas end up in the same state

RSM How to maintain a single order in the face of concurrent client requests? op A op B op A op B op B op A

RSM: primary/backup Primary/backup: ensure a single order of ops: –Primary orders operations –Backups execute operations in order op A op B primarybackup op A op B

Case study: Hypervisor [Bressoud and Schneider] Goal: fault tolerant computing –Banks, NASA etc. need it –CPUs are most likely to fail due to complexity Hypervisor: primary/backup replication –If primary fails, backup takes over –Caveat: assuming failure detection is perfect

Hypervisor replicates at VM-level Why replicating at VM-level? –Hardware fault-tolerant machines are big in 80s –Software solution is more economical –Replicating at O/S level is messy (many interfaces) –Replicating at app level requires programmer efforts –Replicating at VM level has a cleaner interface (and no need to change O/S or app) Primary and backup execute the same sequence of machine instructions

A Strawman design Two identical machines Same initial memory/disk contents Start execute on both machines Will they perform the same computation? mem

Strawman flaws To see the same effect, operations must be deterministic What are deterministic ops? –ADD, MUL etc. –Read time-of-day register, cycle counter, privilege level? –Read memory? –Read disk? –Interrupt timing? –External input devices (network, keyboard)

Hypervisor’s architecture mem Strawman replicates disks at both machines Problem: disks might not behave identically (e.g. fail at different sectors) primary backup SCSI bus Hypervisor connects devices to to both machines Only primary reads/writes to devices Primary sends read values to backup Only primary handles interrupts from h/w Primary sends interrupts to backup ethernet

Hypervisor executes in epochs Challenge: must execute interrupts at the same point in instruction streams on both nodes Strawman: execute one instruction at a time –Backup waits from primary to send interrupt at end of each instruction –Very slow…. Hypervisor executes in epochs –CPU h/w interrupts every N instructions (so both nodes stop at the same point) –Primary delays all interrupts till end of an epoch –Primary sends all interrupts to backup

Hypervisor failover If primary fails, backup must handle I/O Suppose primary fails at epoch E+1 –In Epoch E, backup times out waiting for [end, E+1] –Backup delivers all buffered interrupts at the end of E –Backup starts epoch E+1 –Backup becomes primary at epoch E+2

Hypervisor failover Backup does not know if primary executed I/O epoch E+1? –Relies on O/S to re-try the I/O Device needs to support repeated ops –OK for disk writes/reads –OK for network (TCP will figure it out) –How about keyboard, printer, ATM cash machine?

Hypervisor implementation Hypervisor needs to trap every non- deterministic instruction –Time-of-day register –HP TLB replacement –HP branch-and-link instruction –Memory-mapped I/O loads/stores Performance penalty is reasonable –A factor of two slow down –How about its performance on modern hardware?

Caveats in Hypervisor Hypervisor assumes failure detection is perfect What if the network between primary/backup fails? –Primary is still running –Backup becomes a new primary –Two primaries at the same time! Can timeouts detect failures correctly? –Pings from backup to primary are lost –Pings from backup to primary are delayed

Paxos: fault tolerant agreement Paxos lets all nodes agree on the same value despite node failures, network failures and delays Extremely useful: –e.g. Nodes agree that X is the primary –e.g. Nodes agree that Y is the last operation executed

Paxos: general approach One (or more) node decides to be the leader Leader proposes a value and solicits acceptance from others Leader announces result or try again

Paxos requirement Correctness (safety): –All nodes agree on the same value –The agreed value X has been proposed by some node Fault-tolerance: –If less than N/2 nodes fail, the rest nodes should reach agreement eventually w.h.p –Liveness is not guaranteed

Why is agreement hard? What if >1 nodes become leaders simultaneously? What if there is a network partition? What if a leader crashes in the middle of solicitation? What if a leader crashes after deciding but before announcing results? What if the new leader proposes different values than already decided value?

Paxos setup Each node runs as a proposer, acceptor and learner Proposer (leader) proposes a value and solicit acceptence from acceptors Leader announces the chosen value to learners

Strawman Designate a single node X as acceptor (e.g. one with smallest id) –Each proposer sends its value to X –X decides on one of the values –X announces its decision to all learners Problem? –Failure of the single acceptor halts decision –Need multiple acceptors!

Strawman 2: multiple acceptors Each proposer (leader) propose to all acceptors Each acceptor accepts the first proposal it receives and rejects the rest If the leader receives positive replies from a majority of acceptors, it chooses its own value –There is at most 1 majority, hence only a single value is chosen Leader sends chosen value to all learners Problem: –What if multiple leaders propose simultaneously so there is no majority accepting?

Paxos solution Proposals are ordered by proposal # Each acceptor may accept multiple proposals –If a proposal with value v is chosen, all higher proposals have value v

Paxos operation: node state Each node maintains: –n a, v a : highest proposal # and its corresponding accepted value –n h : highest proposal # seen –my n: my proposal # in current Paxos

Paxos operation: 3P protocol Phase 1 (Prepare) –A node decides to be leader (and propose) –Leader choose my n > n h –Leader sends to all nodes –Upon receiving If n < n h reply Else n h = n reply This node will not accept any proposal lower than n

Paxos operation Phase 2 (Accept): –If leader gets prepare-ok from a majority V = non-empty value corresponding to the highest n a received If V= null, then leader can pick any V Send to all nodes –If leader fails to get majority prepare-ok Delay and restart Paxos –Upon receiving If n < n h reply with else n a = n; v a = V; n h = n reply with

Paxos operation Phase 3 (Decide) –If leader gets accept-ok from a majority Send to all nodes –If leader fails to get accept-ok from a majority Delay and restart Paxos

Paxos operation: an example Prepare,N1:1 N0N1N2 n h =N1:0 n a = v a = null n h =N0:0 n a = v a = null n h = N1:1 n a = null v a = null ok, n a = v a =null Prepare,N1:1 ok, n a =v a =nulll n h : N1:1 n a = null v a = null n h =N2:0 n a = v a = null Accept,N1:1,val1 n h =N1:1 n a = N1:1 v a = val1 n h =N1:1 n a = N1:1 v a = val1 ok Decide,val1

Paxos properties When is the value V chosen? 1.When leader receives a majority prepare-ok and proposes V 2.When a majority nodes accept V 3.When the leader receives a majority accept- ok for value V

Understanding Paxos What if more than one leader is active? Suppose two leaders use different proposal number, N0:10, N1:11 Can both leaders see a majority of prepare-ok?

Understanding Paxos What if leader fails while sending accept? What if a node fails after receiving accept? –If it doesn’t restart … –If it reboots … What if a node fails after sending prepare-ok? –If it reboots …

Using Paxos for RSM Fault-tolerant RSM requires consistent replica membership –Membership: –RSM goes through a series of membership changes.. Use Paxos to agree on the for a particular vid

Lab5: Using Paxos for RSM vid1: N1 vid2: N1,N2 vid3: N1,N2, N3 vid4: N1,N2 All nodes start with static config vid1:N1 N2 joins A majority in vid1:N1 accept vid2: N1,N2 N3 joins A majority in vid2:N1,N2 accept vid3: N1,N2,N3 N3 fails A majority in vid3:N1,N2,N3 accept vid4: N1,N2

Lab5: Using Paxos for RSM vid1: N1 vid2: N1,N2 vid1: N1 vid2: N1,N2 N1 N2 N3 vid1: N1 N3 joins Prepare, vid2, N3:1 oldview, vid2=N1,N2

Lab5: Using Paxos for RSM vid1: N1 vid2: N1,N2 vid1: N1 vid2: N1,N2 N1 N2 N3 vid1: N1 N3 joins vid2: N1,N2 Prepare, vid3, N3:1

Lab6: re-configurable RSM Use RSM to replicate lock_server Primary in each view assigns a viewstamp to each client requests –Viewstamp is a tuple (vid:seqno) –(0:0)(0:1)(0:2)(0:3)(1:0)(1:1)(1:2)(2:0)(2:1) All replicas execute client requests in viewstamp order

Lab6: Viewstamp replication To execute an op with viewstamp (vs), a replica must have executed all ops < vs A newly joined replica need to transfer state to ensure its state reflect executions of all ops < vs

Lab5: Using Paxos for RSM vid1: N1 N1 N2 vid1: N1 N2 joins Prepare, accept,decide etc. … myvs:(1:50) Transfer state to bring N2 up-to-date Largest vs (1:50)