Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore Jun Rao, Eugene J. Shekita, Sandeep Tata IBM Almaden Research Center PVLDB,

Slides:



Advertisements
Similar presentations
Paxos and Zookeeper Roy Campbell.
Advertisements

P. Hunt, M Konar, F. Junqueira, B. Reed Presented by David Stein for ECE598YL SP12.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.
High throughput chain replication for read-mostly workloads
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno Jacobsen,
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Life after CAP Ali Ghodsi CAP conjecture [reminder] Can only have two of: – Consistency – Availability – Partition-tolerance Examples.
Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
Distributed Databases
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Synchronization Methods for Multicore Programming Brendan Lynch.
Distributed Storage System Survey
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
IBM Almaden Research Center © 2011 IBM Corporation 1 Spinnaker Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore Jun Rao Eugene.
Alireza Angabini Advanced DB class Dr. M.Rahgozar Fall 88.
Cassandra - A Decentralized Structured Storage System
PNUTS PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
Fault Tolerant Services
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Consensus and leader election Landon Cox February 6, 2015.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Bigtable: A Distributed Storage System for Structured Data
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
CSci8211: Distributed Systems: Raft Consensus Algorithm 1 Distributed Systems : Raft Consensus Alg. Developed by Stanford Platform Lab  Motivated by the.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
Bigtable A Distributed Storage System for Structured Data.
Detour: Distributed Systems Techniques
Database Recovery Techniques
Cassandra - A Decentralized Structured Storage System
CS 347: Parallel and Distributed Data Management Notes07: Data Replication Hector Garcia-Molina CS 347 Notes07.
Distributed Systems – Paxos
Alternative system models
CSE-291 (Cloud Computing) Fall 2016
Cassandra Transaction Processing
PNUTS: Yahoo!’s Hosted Data Serving Platform
Introduction to HDFS: Hadoop Distributed File System
Replication Middleware for Cloud Based Storage Service
Implementing Consistency -- Paxos
Distributed P2P File System
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Implementing Consistency -- Paxos
Presentation transcript:

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore Jun Rao, Eugene J. Shekita, Sandeep Tata IBM Almaden Research Center PVLDB, Jan. 2011, Vol. 4, No Presented by Yongjin Kwon

Copyright  2011 by CEBT Outline  Introduction  Spinnaker Data Model and API Architecture  Replication Protocol Leader Election  Recovery Follower Recovery Leader Takeover  Experiments  Conclusion 2

Copyright  2011 by CEBT Introduction  Cloud computing applications have aggressive requirements. Scalability High and continuous availability Fault Tolerance  The CAP Theorem [Brewer 2000] argues that among Consistency, Availability, and Partition tolerance, only two out of three are possible.  Recent distributed systems such as Dynamo or Cassandra provide high availability and partition tolerance by sacrificing consistency. Guarantee eventual consistency. May cause diverse versions of replicas. 3

Copyright  2011 by CEBT Introduction (Cont’d)  Most applications will desire stronger consistency guarantees. e.g. single datacenter where network partitions are rare  How to preserve consistency? Two-Phase Commit – Blocking exists when the coordinator fails Three-Phase Commit [Skeen 1981] – Seldom used because of poor performance Paxos Algorithm – Generally perceived as too complex and slow 4

Copyright  2011 by CEBT Introduction (Cont’d)  Timeline Consistency [Cooper 2008] Stops short of full serializability. All replicas of a record apply all updates in the same order. At some time, any replica will be one of diverse versions from the timeline.  With some modifications of Paxos, it is possible to provide high availability while ensuring at least timeline consistency with a very small loss of performance. 5 InsertUpdate Delete timeline

Copyright  2011 by CEBT Spinnaker  Experimental datastore Designed to run on a large cluster of commodity servers in a single datacenter Key-based range partitioning 3-way replication Strong or timeline consistency Paxos-based protocol for replication Example of a CA system 6

Copyright  2011 by CEBT Data Model and API  Data Model Similar to Bigtable and Cassandra Data is organized into rows and tables. – Each row in a table can be uniquely identified by its key. A row may contain any number of columns with corresponding values and version numbers.  API get(key, colname, consistent) put(key, colname, colvalue) delete(key, colname) conditionPut(key, colname, value, version) conditionDelete(key, colname, version) 7

Copyright  2011 by CEBT Architecture  System Architecture Data (or rows in a table) are distributed across a cluster using (key- )range partitioning. Each group of nodes in a key range is called a cohort. – Cohort for [0, 199] : { A, B, C } – Cohort for [200, 399] : { B, C, D } 8

Copyright  2011 by CEBT Architecture (Cont’d)  Node Architecture All the components are thread safe. For logging – Shared write-ahead log is used for performance – Each log record is uniquely identified by an LSN (log sequence number). – Each cohort on a node uses its own logical LSNs. 9 Logging and Local Recovery Logging and Local Recovery Commit Queue memtables SSTables memtables SSTables Replication and Remote Recovery Replication and Remote Recovery Failure Detection, Group Membership, and Leader Selection Failure Detection, Group Membership, and Leader Selection

Copyright  2011 by CEBT Replication Protocol  Each cohort consists of a elected leader and two followers.  Spinnaker’s Replication Protocol Modification of the basic Multi-Paxos protocol – Shared write-ahead log, Not missing any log entries – Reliable in-order messages based on TCP sockets – Distributed coordination service for leader election (Zookeeper)  Two Phases of Replication Protocol Leader Election Phase – A leader is chosen among the nodes in a cohort. Quorum Phase – The leader proposes a write. – The followers accept it. 10

Copyright  2011 by CEBT Replication Protocol (Cont’d)  Quorum Phase Client summits a write W. The leader, in parallel, – appends a log record for W, and forces it to disk, – appends W to its commit queue, and – sends a propose message for W to its followers. 11 write W Leader Follower Cohort propose W

Copyright  2011 by CEBT Replication Protocol (Cont’d)  Quorum Phase After receiving the propose message, the followers – appends a log record for W, and forces it to disk, and – appends W to its commit queue, and – sends an ACK to the leader. 12 Leader Follower Cohort ACK

Copyright  2011 by CEBT Replication Protocol (Cont’d)  Quorum Phase After the leader gets an ACK from “at least one” follower, the leader – applies W to its memtable, effectively committing W, and – sends a response to the client. There is no separate commit record that needs to be logged. 13 Leader Follower Cohort Committed!

Copyright  2011 by CEBT Replication Protocol (Cont’d)  Quorum Phase Periodically the leader sends an asynchronous commit message to the followers, with a certain LSN, asking them to apply all pending writes up to the LSN, to their memtables. For recovery, the leader and followers save this LSN, referred to as the last committed LSN. 14 Leader Follower Cohort commit LSN

Copyright  2011 by CEBT Replication Protocol (Cont’d)  For strong consistency, Reads are always routed to the cohort’s leader. Reads are guaranteed to see the latest value.  For timeline consistency, Reads can be routed to any node in the cohort. Reads may see a stale value. 15

Copyright  2011 by CEBT Leader Election  The leader election protocol has to guarantee that the cohort will appear a majority (i.e. two nodes) and the new leader is chosen in a way that no committed writes are lost.  With the aid of Zookeeper, this task can be simplified. Each node includes a Zookeeper client.  Zookeeper [Hunt 2010] Fault tolerant, distributed coordination service It is only used to exchange messages between nodes. Ref : 16

Copyright  2011 by CEBT Leader Election (Cont’d) 17  Zookeeper’s Data Model Resembles a directory tree in a file system. Each node, znode, is identified by its path from the root. – e.g. /a/b/c A znode can include a sequential attribute. Persistent znode vs. Ephemeral znode a b ……… c

Copyright  2011 by CEBT Leader Election (Cont’d)  Note that information needed for leader election is stored in Zookeeper under “/r”.  Leader Election Phase One of the cohort’s nodes cleans up any state under /r. Each node of the cohort adds a sequential ephemeral znode to /r/candidates with value “last LSN.” After a majority appears under /r/candidates, the new leader is chosen as the candidate with the max “last LSN.” The leader adds an ephemeral znode under /r/leader with value “its hostname,” and execute leader takeover. The followers learn about the new leader by reading /r/leader. 18

Copyright  2011 by CEBT Leader Election (Cont’d)  Verification of the guarantee that no committed writes are lost Committed write is forced to the logs of at least 2 nodes. At least 2 nodes have to participate in leader election. Hence, at least one of the nodes participating in leader election will have the last committed write in its log. Choosing the node with max “last LSN” ensures that the new leader will have this committed write in its log.  If committed writes are not unresolved on the other nodes, leader takeover will make sure that it is re-proposed. 19

Copyright  2011 by CEBT Recovery  When a cohort’s leader and followers fails, the recovery should be performed, using log records, after they come back up.  Two Recovery Processes Follower Recovery – When a follower or even leader fails, how can the node be recovered after it comes back up? Leader Takeover – When a leader has failed, what should the new leader perform after leader election? 20

Copyright  2011 by CEBT Follower Recovery  The follower recovery is executed whenever a node comes back up after a failure.  Two Phases of Follower Recovery Local Recovery Phase – Re-apply log records from its most recent checkpoint through its last committed LSN. – If the follower has lost all its data due to a disk failure, then it moves to the catch up phase immediately. Catch Up Phase – Send its last committed LSN to the leader. – The leader responds by sending all committed writes after the follower’s last committed LSN. 21 … checkpointlast committed LSNlast LSN Local RecoveryCatch Up

Copyright  2011 by CEBT Follower Recovery (Cont’d)  If a leader went down and a new leader was elected, it would be possible that the new leader neglected some of the log records after the last committed LSN. The discarded log records should be removed so that they are never re-applied by future recovery.  Logical Truncation of the follower’s log The LSNs of log records belonging to the follower are stored in a skipped LSN list. Before processing log records, check the skipped LSN list whether the log record should be discarded. 22

Copyright  2011 by CEBT Leader Takeover  When a leader fails, the corresponding cohort becomes unavailable for write. Execute the leader election to choose a new leader!  After a new leader is elected, leader takeover occurs.  Leader Takeover Catch up each follower to the new leader’s last committed LSN. – This step may be ignored by the follower. Re-propose the writes between leader’s last committed LSN and leader’s last LSN, and commit using the normal replication protocol. 23 … follower’s last committed LSN leader’s last commited LSN leader’s last LSN Catch upRe-proposal

Copyright  2011 by CEBT Recovery (Cont’d) 24  Follower Recover Follower goes down while the others are still alive. The cohort accepts new writes. When the follower comes back up, the follower is recovered. Cohort Leader cmt : 1.20 lst : 1.21 Follower cmt : 1.10 lst : 1.20 Follower cmt : 1.10 lst : 1.22 cmt : 1.25 lst : 1.25 cmt : 1.20 lst : 1.25 cmt : 1.25 lst : 1.25

Copyright  2011 by CEBT Recovery (Cont’d) 25  Leader Takeover Leader goes down while the others are still alive. The new leader is elected, and leader takeover is executed. The cohort accepts new writes. When the old leader comes back up, it is recovered. Cohort Leader cmt : 1.20 lst : 1.21 Follower cmt : 1.10 lst : 1.19 Follower cmt : 1.10 lst : 1.20 cmt : 2.30 lst : 2.30 cmt : 1.20 lst : 1.20 cmt : 1.20 lst : 1.20 Leader cmt : 2.30 lst : 2.30 cmt : 2.30 lst : 2.30 Follower logical truncation (LSN 1.21)

Copyright  2011 by CEBT Experiments  Experimental Setup Two clusters (one for datastore, the other for clients) of 10 nodes, each of which consists of – Two quad-core 2.1 GHz AMD processors – 16GB memory – 5 SATA disks, with 1 disk for logging (without write-back cache) – Rack-level 1Gbit Ethernet switch Cassandra trunk as of October 2009 Zookeeper version

Copyright  2011 by CEBT Experiments (Cont’d)  In these experiments, Spinnaker was compared with Cassandra. Common things – Implementation of SSTables, memtables, log manager – 3-way replication Different things – Replication protocol, recovery algorithms, commit queue Cassandra’s weak/quorum reads – Weak read accesses just 1 replica. – Quorum read accesses 2 replicas to check for conflicts. Cassandra’s weak/quorum writes – Both are sent to all 3 replicas. – Weak write waits for an ACK from just 1 replica. – Quorum write waits for ACKs from any 2 replicas. 27

Copyright  2011 by CEBT Experiments (Cont’d) 28

Copyright  2011 by CEBT Conclusion  Spinnaker Paxos-based replication protocol Scalable, consistent, and highly available datastore  Future Work Support for multi-operation transactions Load balancing Detailed comparison to other datastores 29

Copyright  2011 by CEBT References  [Brewer 2000] E. A. Brewer, “Towards Robust Distributed Systems,” In PODC, pp. 7-7,  [Cooper 2008] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.- A. Jacobsen, N. Puz, D. Weaver, R. Yerneni, “PNUTS: Yahoo!’s Hosted Data Serving Platform,” In PVLDB, 1(2), pp ,  [Hunt 2010] P. Hunt, M. Konar, F. P. Junqueira, B. Reed, “Zookeeper: Wait-Free Coordination for Internet-scale Systems,” In USENIX,  [Skeen 1981] D. Skeen, “Nonblocking Commit Protocols,” In SIGMOD, pp ,