© Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri TU Darmstadt, Germany
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 2 Resilience of Critical Services request reply clients n ≥ 2t+1 replicas server request no reply clients SMR Safety Critical Systems Resilience against catastrophic failures State Machine Replication Illusion of a single server that never fails Wide Area Replication Large and unpredictable delays in WANs latency-optimal protocol
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 3 Which Consensus Protocol State Machine Replication (SMR) Clients propose commands to replicas Agreement on sequence of commands → replicas are in consistent state when executing command sequence Consensus protocol needed Latency-optimal protocols Latency: #message delays between when client proposes command and when command is learned by learner (to be executed). Two Protocols by Lamport Classic Paxos (CP) 3 message delays (during normal operation)3 message delays (during normal operation) Majority quorum for recoveryMajority quorum for recovery Fast Paxos (FP) 2 message delays (during normal operation)2 message delays (during normal operation) message delays in presence of collisions2 + 4 message delays in presence of collisions Larger quorum for recoveryLarger quorum for recovery Client → Leader → Acceptors → Client Client → Acceptors → Client
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 4 Paxos vs. Fast Paxos Compared Latency “Planetlab” Experiments Simulation of the CP and FP msg. patterns (different topologies) FP not always faster than CP Some clients prefer CP, some FP Single crash can turn setting
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 5 Motivation for a Hybrid Protocol No clear winner between CP and FP With respect to latency Hybrid Protocol: Hybrid Paxos (HP) Runs CP and FP in parallel Chooses quickest outcome of two protocols Implements Generalized Consensus Commuting commands may be chosen in any orderCommuting commands may be chosen in any order Does not negatively affect throughput FP mode switched off when not beneficialFP mode switched off when not beneficial
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 6 Outline of the Talk Contribution System Model Background on Paxos and Generalized Consensus Hybrid Paxos protocol Evaluation Discussion Conclusion
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 7 Contribution Hybrid Paxos (HP) CP with additional “fast mode“ Fast learning in absence of collisions 3 msg delays as CP in presence of collisions Latency optimal 2f+1 servers, f may crash (optimal) Linear number of messages (optimal) First efficient implementation of Generalized Consensus Experiments using Emulab HP reaches theoretical minimum of latency HP does not negatively affect throughput
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 8 System Model Distributed System n servers Any number of clients (may crash) Communication via reliable FIFO channels Crash-stop model At most minority of servers fails (n ≥ 2f+1), f = #crashes Asynchrony Ω Failure detector (eventually outputs same correct leader) Generalized Consensus Command History Equivalence class of command sequences Sequences c 1 and c 2 are equivalent iff executing them produces same outputs and state commuting commands clientsservers
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 9 Background on Generalized Consensus Protocol operates on command history = equivalence class of command sequences Terms on histories Prefix relation on histories glb of histories (largest common prefix, intersection) lub of histories (smallest common extension, union) h and h‘ compatible iff exists g: h g, h‘ g Definition of Generalized Consensus Consistency: every two learned histories are compatible. Nontriviality: if history is chosen than all contained commands have been proposed. Conservatism: if history h is learned, then h was chosen. Progress: if command c is proposed, eventually a history containing c is learned.
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 10 Background on Paxos Family Following holds for CP, FP, and HP Clients are proposers and learners Servers are acceptors Cooperate to choose single comand history Acceptors query Ω and elect leader among them Unique Leader needed for progress only Paxos * protocols operate in rounds Each leader is preassigned a set of round numbers Operation modes Recovery, to change rounds (must ensure consistency) Normal operation Quorums of acceptors CP: any two quorums intersect FP: requires larger fast quorums intersection of quorum and fast quorum FQ is larger than n-|FQ|intersection of quorum and fast quorum FQ is larger than n-|FQ| |FQ| n-|FQ| n-|FQ|+1
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 11 CP and FP Message Patterns Recovery (all protocols) cl ld acc Normal Operation of FP cl ld acc Normal Operation of CP Fast modeRecovery from collision 1a 1b2a 2b Phase 1Phase 2 2a2b 2bfast 1a 1b 2a 2b propose chosen
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 12 Ideas behind Message Patterns Normal Operation CP Client sends proposal (command) to leader Leader appends command to history and sends history to acceptors (2a) Acceptors accept history as local history Acceptors send history back to client (2b) Normal Operation FP Client sends proposal to acceptors Acceptors append commands to local fast history (optimistic) Acceptors send history back to client (and leader) (2bfast) Collision Recovery triggered by Leader Recovery (to start a new round) Phase 1: initialized by new leader (1a) Acceptors send local histories to leader (1b) Leader determines chosen history Phase 2: Leader synchronizes acceptors to chosen history (2a) Reply to clients (2b) Core of protocol
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 13 Combining the two protocols cl ld acc 2a 2b propose 2bfast chosen propose 2bfast Execute CP and FP pattern in parallel CP with additional FP mode Acceptors locally maintain fast and classic history History from ld as classic historyHistory from ld as classic history Commands from cl appended to fast historyCommands from cl appended to fast history No naïve combination Clients learn either by receiving Quorum of equal 2b messages (learn)Quorum of equal 2b messages (learn) Fast Quorum of equal 2bfast messages and one 2b messageFast Quorum of equal 2bfast messages and one 2b message (hybrid learn) CPFP HP Needed also in FP for speculative execution
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 14 Hybrid Recovery Same message pattern Acceptors maintain separate histories Classic history Fast history Leader perform CP and FP like recoveries in parallel Determines history fh from FP recovery Determines history h from CP recovery Problem: h and fh might be incompatible (no common extension) Determine largest prefix pfh of fh which is compatible with h Pick lub of pfh and h (smallest common extension) Why is this correct (sufficient for Consistency)? To show: any history lh learned by hybrid learn is prefix of pfh. lh fh, and all prefixes of fh compatible with h are prefixes of pfh Sufficient to show: lh compatible with h By hybrid learning: some acceptor holds lh as classic history lh and h have been sent by leader lh and h are compatible Neither h nor fh sufficient Goal: lub of h and fh
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 15 Implementation Optimization Optimization 1 (msg complexity) Leader does not send entire history to acceptors (2a) FIFO channels Optimization 2 (execution) Implementing state machine at servers Only leader executes commands (speculatively) Prevents rollbacks at acceptors Clients receive history digests + result Optimization 3 (latency) Diverging fast and classic histories during normal mode prevents hybrid learning Periodically acceptors locally align fh to h (as in hybrid recovery) Optimization 4 (throughput) FP mode switched off during high load Leader monitors load Also true for FP
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 16 Evaluation Experimental setting Banking system, two operations deposit and withdraw deposit operations are commutable (Generalized Consensus) Emulab test bed 20ms link delay between client and servers, 100Mbps Topology similar to “Europe“ topology from beginning of presentation Servers 600Mhz PC, Fedora 6
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 17 Latency Latency of HP with varying withdraw rate = probability of collisions Latency vs throughput (with and w/o batching)
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 18 Throughput Throughput with increasing clients Throughput with increasing number of f
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 19 Related Work [Lamport: ACM Computer 1998] The Part-Time parliament [Lamport: Dist. Comp. 2006] Fast Paxos [Lamport: TR2005] Generalized Consensus and Paxos [Dobre, Suri DSN2006] One-step Consensus with Zero-degradation [Charron-Bost, Schiper: PRDC2006] Improving Fast Paxos: Being Optimal with no Overhead Minimum latency of FP and CP only in failure-free runs [Camargos, Schmidt, Pedone: NCA2008] Mulitcoordinated Agreement Protocols for Higher Availability Improved availability of CP by multiple leaders; collision resolution req. [Zielinski: DISC2005] Optimistic Generic Broadcast Parallel execution of CP and FP; not resilience optimal; quadratic msg complexity [Mao, Junqueira, Marzullo: OSDI2008] Mencius: Building Efficient Replicated State Machine for WANs Based on CP; partition consensus instances among several leaders (throughput) Each client has LAN connection to one leader (latency) Perfect failure detector needed
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 20 Discussion Comparison to CP Implements CP Never worse than CP FP mode switched off when leader is highly loaded Comparison to FP HP and FP need 2 msg delays in absence of collisions HP needs 3, FP needs 6 msg delays in presence of collisions Experiments: Collision rate grows faster than server utilization rate Servers underutilized when hybrid learning rate below 50%Servers underutilized when hybrid learning rate below 50% FP would spend >50% of the time recovering from collisionsFP would spend >50% of the time recovering from collisions Optimizations Batching possible Increasing throughput by a magnitude
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 21 Summary HP: Hybrid Paxos Idea: add fast learning to Paxos Generalized Consensus protocol First protocol with 2 msg delays in absence of collisions and 3 msg delays otherwise Optimal latency, resilience and number of messages Generalized Consensus is practical approach for WAN replication HP can outperform state of the art protocols HP is a Generalized Consensus protocol that features minimal latency and maximum throughput in most situations !
EDCC, Valencia, January 24, 2016January 24, 2016 Matthias Majuntke 22 Thank you for your attention! Questions?