S-Paxos: Eliminating the Leader Bottleneck

Slides:



Advertisements
Similar presentations
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
Decentralized Reactive Clustering in Sensor Networks Yingyue Xu April 26, 2015.
The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
Scalable Content-aware Request Distribution in Cluster-based Network Servers Jianbin Wei 10/4/2001.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
CS 582 / CMPE 481 Distributed Systems
CS533 - Concepts of Operating Systems 1 Remote Procedure Calls - Alan West.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
1 A Comparison of Mechanisms for Improving TCP Performance over Wireless Links Course : CS898T Instructor : Dr.Chang - Swapna Sunkara.
Paxos Quorum Leases Sayed Hadi Hashemi.
Low-Latency Multi-Datacenter Databases using Replicated Commit
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
ISO Layer Model Lecture 9 October 16, The Need for Protocols Multiple hardware platforms need to have the ability to communicate. Writing communications.
Implementation and Evaluation of a Protocol for Recording Process Documentation in the Presence of Failures Zheng Chen and Luc Moreau
Efficient Network-Coding-Based Opportunistic Routing Through Cumulative Coded Acknowledgments Dimitrios Koutsonikolas, Chih-Chun Wang and Y. Charlie Hu.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
Overlay Network Physical LayerR : router Overlay Layer N R R R R R N.
CS332, Ch. 26: TCP Victor Norman Calvin College 1.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Practical Byzantine Fault Tolerance
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
A P2P-Based Architecture for Secure Software Delivery Using Volunteer Assistance Purvi Shah, Jehan-François Pâris, Jeffrey Morgan and John Schettino IEEE.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
Chap 7: Consistency and Replication
Energy-Efficient Protocol for Cooperative Networks.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
Efficient Opportunistic Sensing using Mobile Collaborative Platform MOSDEN.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
BChain: High-Throughput BFT Protocols
Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li
Distributed Systems – Paxos
Alternative system models
Replication Middleware for Cloud Based Storage Service
Transport Layer Unit 5.
EECS 498 Introduction to Distributed Systems Fall 2017
Principles of Computer Security
Active replication for fault tolerance
Fault-tolerance techniques RSM, Paxos
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Implementing Consistency -- Paxos
doc.: IEEE yy/xxxxr0 Date:
Last Class: Fault Tolerance
Sisi Duan Assistant Professor Information Systems
Presentation transcript:

S-Paxos: Eliminating the Leader Bottleneck Martin Biely, Zarko Milosevic, Nuno Santos, André Schiper Ecole Polytechnique Fédérale de Lausanne (EPFL) Switzerland October 9, 2012

Context: State Machine Replication Nuno Santos Context and Motivation Context: State Machine Replication Consistency among replicas ensured by Deterministic service Same initial state Same sequence of requests System model Partially synchronous Crash stop (max 𝑓 crashes) Replicated Service Service Service Service Fault tolerance by replication. Multiple independent replicas, so that if one fails the others will continue providing the service. The state of the replicas must remain consistent even in the presence of faults. This is ensured by… Ordering protocol (Paxos) Clients

Context and Motivation Nuno Santos Context and Motivation The Paxos Protocol Paxos is a leader-based protocol A distinguished process (leader) coordinates the others (followers) Paxos is leader-based Leader-election Oracle elects a leader. Once leader is elected, runs Phase 1 where it prepares to order requests. The purpose of this phase is for the leader to get updated with any previous decisions. Once Phase 1 completes, the leader is ready to order client requests. Clients send the requests directly to leader. For that, it executes several Phase 2 instances, each ordering one or more requests. Once the leader has a request that it wants to order, it assigns it a tentative position and sends a Phase 2a message to all, containing the chosen position and the request. In the good case, the other replicas reply with a Phase 2b message accepting the order and once the leader receives a majority of replies it decides the order. The request can then be executed and the reply sent to the client. Observation: leader receives and sends more messages than the followers Potential system bottleneck…

Paxos Performance The bottleneck in Paxos is typically the leader Nuno Santos Context and Motivation Paxos Performance Experimental settings JPaxos – implementation of Paxos in Java (protocol shown previously) n=3, request size=20 bytes, CPU 2x2cores @2.2Ghz The bottleneck in Paxos is typically the leader

Paxos is Leader-centric Nuno Santos Context and Motivation Paxos is Leader-centric Leader-centric protocol The leader does considerably more work than the followers Therefore, the leader is prone to being the system bottleneck Paxos and most leader-based protocols are also leader-centric

Leader-based vs Leader-centric Nuno Santos Context and Motivation Leader-based vs Leader-centric Note that leader-based ≠ leader-centric Leader-based – algorithmic concept, leader is a distinguished process Leader-centric – resource usage, leader is a bottleneck Mention that the problem is unbalanced resource usage Question: do leader-based protocols like Paxos must also be leader-centric?

S-Paxos Overview Leader-based but not leader-centric Nuno Santos Say S stands for scalable. Or just omit it. Leader-based but not leader-centric

Why Paxos is Leader-centric Nuno Santos S-Paxos Overview Why Paxos is Leader-centric Leader does the following Receives requests from clients Coordinates protocol to order requests Replies to clients Followers do much less Receive client requests from leader Acknowledge order proposed by leader Underlying problem: unbalanced resource utilization Leader runs out of resources (CPU, network bandwidth) While followers are lightly loaded Mention semantics of arrows.

S-Paxos: A Balanced Paxos Variant Nuno Santos S-Paxos Overview S-Paxos: A Balanced Paxos Variant S-Paxos balances workload across replicas Leader and followers have similar resource usage The full resources of all replicas become available to the ordering protocol S-Paxos is leader-based but not leader-centric Combines several well-known ideas in a novel way All replicas handle client communication All replicas disseminate requests Ordering done on IDs

S-Paxos key ideas Distribute client communication Nuno Santos S-Paxos Overview S-Paxos key ideas Distribute client communication All replicas handle client communication Commonly used in practice For instance, ZooKeeper But by itself, still leader-centric Leader runs the ordering protocol on requests (Phase 2a messages of Paxos) Followers have to forward requests to leader And hence, sends requests to other followers

S-Paxos key ideas Distribute request dissemination Nuno Santos S-Paxos Overview S-Paxos key ideas Distribute request dissemination Note that Phase 2a messages have a dual purpose Dissemination of requests Establishing order All replicas disseminate requests Ordering performed on IDs S-Paxos separates dissemination from ordering

S-Paxos Architecture and Data Flow Nuno Santos S-Paxos Overview S-Paxos Architecture and Data Flow

S-Paxos balances work among replicas Nuno Santos S-Paxos Overview S-Paxos balances work among replicas Client communication and request dissemination usually the bulk of the load In S-Paxos this task is performed by all replicas Leader still has to coordinate ordering protocol But IDs are small messages So leader has minimal additional overhead Batching: a single instance of the ordering protocol can potentially order many client requests, tens or even hundreds. Two levels of batching to further reduce load on leader Dissemination layer: batch client requests and use ordering layer to order ids of batches Ordering layer: usual Paxos batching, in this case batches of batch ids.

Benefits in the presence of faults Nuno Santos S-Paxos Overview Benefits in the presence of faults Faster view change Since IDs are small, Phase 1 of Paxos completes quickly Failures affecting the leader have less impact on throughput Ordering protocol is interrupted, but dissemination protocol continues among working replicas When a correct leader emerges, it can quickly order the IDs of the requests that were disseminated while there was no leader Leader crashes or its links start loosing messages,

Dissemination Layer Protocol Nuno Santos Dissemination Layer Protocol Ordering layer uses Paxos with only minor modifications, so I will focus instead in the dissemination layer.

Dissemination Layer Overview Nuno Santos Dissemination Layer Protocol Dissemination Layer Overview Dissemination layer tasks Receive requests from clients Disseminate requests and IDs to all replicas Initiate ordering of IDs Execute requests in the order established for IDs Challenges Once an ID is decided, the corresponding request must remain available in the system Coordinate view change between ordering and dissemination layers to ensure that ids are ordered once-and-only once 3 4 2 2 Contrary to the conventional Paxos, IDs and requests are handled separately which creates a few new challenges 1

Overview of the Protocol Nuno Santos Dissemination Layer Protocol Overview of the Protocol Disseminating requests Optimistic implementation of reliable broadcast When a replica receives a request from a client, it broadcasts <request,ID> Replicas acknowledge reception of forwarded requests by broadcasting <Ack,ID> Proposing IDs Leader proposes an ID once the corresponding request is stable That is, when it receives 𝑓+1 acknowledgements for the ID Executing requests Replica must have: request and decision for corresponding ID If ID decided before request received, poll other replicas for request after a small delay Request stable, so at least one correct replica has the request Explain why broadcast of requests is done unreliable. Efficiency. If request does not become stable

Performance Evaluation Nuno Santos Performance Evaluation

Performance Evaluation Nuno Santos Experimental Evaluation Performance Evaluation S-Paxos implemented on top of JPaxos, a Java implementation of Paxos Experiments compare JPaxos (leader-centric) S-Paxos (non leader-centric) Testbed: Grid 5000 (helios cluster) CPU: 2x2-cores @ 2.2Ghz Network: 1Gbit Ethernet Experimental parameters Request size: 20 bytes Batch size S-Paxos: dissemination layer 1450 bytes, ordering layer: 50 bytes JPaxos: 1450 bytes Null service Mention the request size and null services. Do not mention hardware details

Load Distribution: Average CPU utilization Nuno Santos Experimental Evaluation Load Distribution: Average CPU utilization JPaxos S-Paxos

Performance with Increasing Number of Clients (n=3) Nuno Santos Experimental Evaluation Performance with Increasing Number of Clients (n=3) Throughput Response time

Experimental Evaluation Nuno Santos Experimental Evaluation Scalability Throughput Resource intensive tasks are distributed.

Throughput with crashes Nuno Santos Experimental Evaluation Throughput with crashes Request size: 1KB, Batch size: 8KB, 𝑛=5 Crash of the leader

Experimental Evaluation Nuno Santos Experimental Evaluation False suspicions Leader is (wrongly) suspected every 10 seconds

Conclusion A leader-based protocol does not need to be leader-centric Nuno Santos Conclusion A leader-based protocol does not need to be leader-centric S-Paxos: balances the workload across replicas Benefits Better performance for the same number of replicas Better scalability with the number of replicas Better performance in the presence of faults

Nuno Santos Additional slides

Dissemination Layer Protocol Nuno Santos Dissemination Layer Protocol Discussion Broadcast of <request,ID>: best effort, no retransmission Avoids cost of reliable broadcast on requests Recovering from partial delivery (message loss/crashes): Request does not become stable - client timeouts and retransmits Request becomes stable – after ID is decided, replicas poll other replicas for request Broadcast of <Ack,ID>: retransmission Ensures that once a request is stable, it will be proposed Almost free in practice: acks are small and can be piggybacked on other messages. Once a request becomes stable, all replicas learn about and so the corresponding ID is proposed.