State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.

Slides:



Advertisements
Similar presentations
Impossibility of Distributed Consensus with One Faulty Process
Advertisements

Replication techniques Primary-backup, RSM, Paxos Jinyang Li.
CS 542: Topics in Distributed Systems Diganta Goswami.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Consensus Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Transparent Checkpoint of Closed Distributed Systems in Emulab Anton Burtsev, Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau University of Utah,
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
State Machine Replication Project Presentation Ido Zachevsky Marat Radan Supervisor: Ittay Eyal Winter Semester 2010.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Fault-tolerance techniques RSM, Paxos Jinyang Li.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
DTHREADS: Efficient Deterministic Multithreading
Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1.
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Introduction to Cyberspace
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Accelerating Mobile Applications through Flip-Flop Replication
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Multithreading Allows application to split itself into multiple “threads” of execution (“threads of execution”). OS support for creating threads, terminating.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 6.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Transparent Fault-Tolerant Java Virtual Machine Roy Friedman & Alon Kama Computer Science — Technion.
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Department of Computer Science and Software Engineering
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Threads. Readings r Silberschatz et al : Chapter 4.
An Efficient Threading Model to Boost Server Performance Anupam Chanda.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
Detour: Distributed Systems Techniques
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Replication Chapter Katherine Dawicki. Motivations Performance enhancement Increased availability Fault Tolerance.
Primary-Backup Replication COS 418: Distributed Systems Lecture 5 Kyle Jamieson.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Primary-Backup Replication
CSCI5570 Large Scale Data Processing Systems
Principles of Computer Security
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
EECS 498 Introduction to Distributed Systems Fall 2017
Fault-tolerance techniques RSM, Paxos
EEC 688/788 Secure and Dependable Computing
Threads Chapter 4.
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Prof. Leonardo Mostarda University of Camerino
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
Chapter 4: Threads.
Lecture 21: Replication Control
Presentation transcript:

State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log

Paxos Made Transparent Heming Cui, Rui Gu, Cheng Liu, Tianyu Chenx, and Junfeng Yang Presented by Hassan Shahid Khan

Motivation ▪High availability ▪Fault-resistance Server Client Network

State Machine Replication (SMR) ▪Model program as ‘state machine’ - States – program data - Transitions – deterministic executions of the program under input requests ▪Ensure correct replicas step through the same sequence of state transitions ▪Need distributed consensus (PAXOS?) to ensure same sequence of input requests to all replicas Client Network Replica

Some problems with SMR systems 1.Cannot handle ‘multi-threaded’ programs (which most server programs are!) - Sources of non-determinism: thread interleaving, scheduling, environment variables etc. 2.Narrowly defined consensus interfaces and require program modification. - Often tailor-made to fit requirements – e.g Chubby – a locking service.

Solution: CRANE (Correctly ReplicAting Nondeterministic Executions) ▪A ‘transparent’ SMR system. ▪Focus on functionality (not replication) – run any program on top of CRANE without any modification.

Contributions 1.Chooses a socket-level consensus interface (POSIX Socket API). ▪To have consensus on the same sequence of socket calls – implements PAXOS. ▪CRANE architecture: one primary replica, all others are backups.

Contributions 2.Handles application level non-determinism via deterministic multithreading (DMT) ▪Based on PARROT [SOSP ‘13] (Schedules pthread synchronizations) ▪Maintains a logical time that advances deterministically on each thread’s synchronization operation. DMT serializes synchronization operations to make the execution deterministic. Thread 1 Thread 2 0: Lock(X) 2: Unlock(X) 1: Lock(Y) 4: Unlock(Y)

Is PAXOS + DMT enough to keep replicas in sync? ▪No! Physical time that each client request arrives at different replicas may be different – causing the execution to diverge! ▪Example: ▪Web Server with two replicas – primary and backup replica ▪Suppose two clients simultaneously send HTTP PUT and GET requests (on url: “a.php”)

Example – Primary Replica ▪There is a large difference between the arrival time of input requests Primary Thread 1Thread DMT PG delay

Example – Backup Replica ▪There is a small difference between the arrival time of input requests Thread 1Thread 2 Backup 1 2 DMT PG

Need to agree upon a ‘logical admission time’ ▪Observation: Traffic is bursty! ▪If requests arrive in a burst, because we already ordered the sequence of requests – their admission time is deterministic!

Time-bubbling 1 DMT P 2 1 P G 2 G Backup Primary Global Sequence of socket calls P W timeout – delay threshold (100us) G P G G G P G G G P G P N clock – logical clock ticks (1000)

Time-bubbling ▪Take-away: You get consistent DMT schedules across all replicas!

Checkpointing & Recovery ▪Storage checkpointing [file-system incl. installation + working dir]: ▪LXC (Linux Containers) ▪Process state checkpointing [memory + register state]: ▪CRIU (Checkpoint/Restore In Userspace) ▪Only does this when server is idle (no alive connections) because of TCP stack state

Evaluation ▪Setup: 3 replica machines, each with 12 cores (with HT), 64 gb memory ▪Five multi-threaded server programs tested: ▪Web servers: Apache, Mongoose ▪Database server: MySQL ▪Anti-virus server: ClamAV ▪Multimedia server: MediaTomb None of the programs required any modification to run with CRANE

Performance Overhead DMT Mean overhead for CRANE is 34%

Time-bubbling Overhead ▪Ratio of time bubbles inserted in all PAXOS consensus requests ▪Per 1000 requests

Handling Failures ▪Primary killed: ▪Leader election invoked, which took 1.97 s ▪One backup killed: ▪Incurred negligible performance overhead as long as other replicas remained consistent.

Final thoughts ▪General-purpose – suited for programs with low pthread sync operations. ▪Limited by the POSIX Socket API and use of the pthreads library ▪Does not detail overhead if replicas > 3 ▪Does not cover all sources of non-determinism. Does not make IPC deterministic

CRANE Architecture (backup)

Checkpoint & Recovery Performance (backup)