Clock-RSM: Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks Jiaqing Du, Daniele Sciascia, Sameh Elnikety Willy Zwaenepoel, Fernando Pedone EPFL, University of Lugano, Microsoft Research
Replicated State Machines (RSM) Strong consistency – Execute same commands in same order – Reach same state from same initial state Fault tolerance – Store data at multiple replicas – Failure masking / fast failover 2
Geo-Replication Data Center High latency among replicas Messaging dominates replication latency 3
Leader-Based Protocols Order commands by a leader replica Require extra ordering messages at follower Leader client request client reply Ordering Replication High latency for geo replication Ordering 4 Follower
Clock-RSM Orders commands using physical clocks Overlaps ordering and replication 5 client request client reply Ordering + Replication Low latency for geo replication
Outline Clock-RSM Comparison with Paxos Evaluation Conclusion 6
Outline Clock-RSM Comparison with Paxos Evaluation Conclusion 7
Property and Assumption Provides linearizability Tolerates failure of minority replicas Assumptions – Asynchronous FIFO channels – Non-Byzantine faults – Loosely synchronized physical clocks 8
Protocol Overview client requestclient reply client requestclient reply 9 PrepOK cmd1.ts = Clock() cmd2.ts = Clock() Clock-RSM cmd1 cmd2 cmd1 cmd2 cmd1 cmd2 cmd1 cmd2 cmd1 cmd2
Major Message Steps Prep: Ask everyone to log a command PrepOK: Tell everyone after logging a command R0R0 R2R2 R1R1 client request R3R3 R4R4 Prep PrepOK cmd1.ts = 24 PrepOK cmd1 committed? client request cmd2.ts = 23 10
Commit Conditions A command is committed if – Replicated by a majority – All commands ordered before are committed Wait until three conditions hold C1: Majority replication C2: Stable order C3: Prefix replication 11
C1: Majority Replication More than half replicas log cmd1 R0R0 R2R2 R1R1 client request R3R3 R4R4 PrepOK cmd1.ts = 24 Prep Replicated by R 0, R 1, R 2 1 RTT: between R 0 and majority 12
C2: Stable Order Replica knows all commands ordered before cmd1 – Receives a greater timestamp from every other replica R0R0 R2R2 R1R1 client request R3R3 R4R4 24 cmd1.ts = RTT: between R 0 and farthest peer cmd1 is stable at R 0 13 Prep / PrepOK / ClockTime
C3: Prefix Replication All commands ordered before cmd1 are replicated by a majority 14 R0R0 R2R2 R1R1 client request R3R3 R4R4 cmd1.ts = 24 cmd2 is replicated by R 1, R 2, R 3 cmd2.ts = 23 Prep PrepOk 1 RTT: R 4 to majority + majority to R 0 client request Prep PrepOk
Overlapping Steps 15 R0R0 R2R2 R1R1 client request R3R3 R4R4 Latency of cmd1 : about 1 RTT to majority client reply Majority replication Stable order Prefix replication PrepOK Prep Log(cmd1) Prep PrepOk cmd1.ts = 24
Commit Latency StepLatency Majority replication 1 RTT (majority1) Stable order 0.5 RTT (farthest) Prefix replication 1 RTT (majority2) Overall latency = MAX{ 1 RTT (majority1), 0.5 RTT (farthest), 1 RTT (majority2) } 16 If 0.5 RTT (farthest) < 1 RTT (majority), then overall latency ≈ 1 RTT (majority).
R0R0 Topology Examples Majority1 Farthest R0R0 Majority1 Farthest R3R3 R4R4 R2R2 R1R1 R4R4 R3R3 R2R2 R1R1 17 client request
Outline Clock-RSM Comparison with Paxos Evaluation Conclusion 18
Paxos 1: Multi-Paxos Single leader orders commands – Logical clock: 0, 1, 2, 3,... R0R0 Leader R 2 R1R1 client request Prep Commit Forward client reply PrepOK R3R3 R4R4 Latency at followers: 2 RTTs (leader & majority) 19
Paxos 2: Paxos-bcast Every replica broadcasts PrepOK – Trades off message complexity for latency R0R0 Leader R 2 R1R1 client request Prep Forward client reply PrepOK R3R3 R4R4 Latency at followers: 1.5 RTTs (leader & majority) 20
Clock-RSM vs. Paxos With realistic topologies, Clock-RSM has – Lower latency at Paxos follower replicas – Similar / slightly higher latency at Paxos leader 21 ProtocolLatency Clock-RSMAll replicas: 1 RTT (majority) if 0.5 RTT (farthest) < 1 RTT (majority) Paxos-bcastLeader: 1 RTT (majority) Follower: 1.5 RTTs (leader & majority)
Outline Clock-RSM Comparison with Paxos Evaluation Conclusion 22
Experiment Setup Replicated key-value store Deployed on Amazon EC2 California (CA) Virginia (VA) Ireland (IR) Singapore (SG) Japan (JP) 23
Latency (1/2) All replicas serve client requests 24
Overlapping vs. Separate Steps CA VA IR SG JP 25 CA VA (L) IR SG JP Clock-RSM latency: max of three Paxos-bcast latency: sum of three client request
Latency (2/2) Paxos leader is changed to CA 26
Throughput Five replicas on a local cluster Message batching is key 27
Also in the Paper A reconfiguration protocol Comparison with Mencius Latency analysis of protocols 28
Conclusion Clock-RSM: low latency geo-replication – Uses loosely synchronized physical clocks – Overlaps ordering and replication Leader-based protocols can incur high latency 29