EECS 498 Introduction to Distributed Systems Fall 2017

Slides:



Advertisements
Similar presentations
COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
Advertisements

Replication techniques Primary-backup, RSM, Paxos Jinyang Li.
CS 542: Topics in Distributed Systems Diganta Goswami.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
ITIS 3110 Jason Watson. Replication methods o Primary/Backup o Master/Slave o Multi-master Load-balancing methods o DNS Round-Robin o Reverse Proxy.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
SFWR ENG 3KO4 Software Development for Computer/Electrical Engineering Fall 2009 Instructor: Dr. Kamran Sartipi Software Requirement Specification (SRS)
Practical Byzantine Fault Tolerance
Operations Master / FSMO Roles in Active Directory : Suhail Ashfaq Butt.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Replication (1). Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Consensus and leader election Landon Cox February 6, 2015.
Distributed Systems CS Consistency and Replication – Part IV Lecture 13, Oct 23, 2013 Mohammad Hammoud.
The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
VIRTUAL NETWORK COMPUTING SUBMITTED BY:- Ankur Yadav Ashish Solanki Charu Swaroop Harsha Jain.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
Primary-Backup Replication COS 418: Distributed Systems Lecture 5 Kyle Jamieson.
REPLICATION & LOAD BALANCING
Primary-Backup Replication
Dynamic Modeling of Banking System Case Study - I
Cluster Communications
Replication State Machines via Primary-Backup
Ivy Eva Wu.
Replication Middleware for Cloud Based Storage Service
View Change Protocols and Reconfiguration
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
Active replication for fault tolerance
Fault-tolerance techniques RSM, Paxos
PERSPECTIVES ON THE CAP THEOREM
EECS 498 Introduction to Distributed Systems Fall 2017
Lecture 21: Replication Control
Replication State Machines via Primary-Backup
Lecture 25: Multiprocessors
EECS 498 Introduction to Distributed Systems Fall 2017
Replication State Machines via Primary-Backup
Distributed Systems CS
Replicated state machine and Paxos
View Change Protocols and Reconfiguration
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 24: Multiprocessors
Lecture 21: Replication Control
Presentation transcript:

EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha

Primary Backup Replication Client Primary Backup Backup September 25, 2017 EECS 498 – Lecture 6

Primary Backup Replication Promote one of the backups if primary fails Replace any failed backup When should primary sync with backups? Before making state change externally visible Primary and backups must be externally consistent What to sync? Entire state when bootstrapping new backup Thereafter, all sources of non-determinism September 25, 2017 EECS 498 – Lecture 6

RSM with Primary Backup Application Application Server1 Operating System Operating System Server2 Virtual Machine Monitor Virtual Machine Monitor Hardware Hardware September 25, 2017 EECS 498 – Lecture 6

Example: Bank Server Deposit(user, amount) { balance[user] += amount show new balance } AddInterest() { while(currtime != 12am) { sleep(1 hour) add 1% interest for all users September 25, 2017 EECS 498 – Lecture 6

Example Execution Primary Backup Timer interrupt fires Wake up AddInterest Read current time T1 Go to sleep Receive deposit request Show new balance Read current time T2 Add interest Receive deposit request Wait for timer interrupt Wake up AddInterest Return current time as T1 Go to sleep Timer interrupt fires Show new balance Return current time as T2 Add interest September 25, 2017 EECS 498 – Lecture 6

Primary Backup Replication Client Primary Backup Backup September 25, 2017 EECS 498 – Lecture 6

Client perspective What does a worker need to know in order to register itself with replicated master? Needs to know which machine is primary September 25, 2017 EECS 498 – Lecture 6

Primary Backup Replication Client Primary Backup Backup September 25, 2017 EECS 498 – Lecture 6

Client perspective What does a worker need to know in order to register itself with replicated master? Needs to know which machine is primary Can primary be hard coded into client code? No, primary gets replaced when it fails How does client discover current primary? September 25, 2017 EECS 498 – Lecture 6

Primary Backup Replication View service Backup Client Primary Backup Backup September 25, 2017 EECS 498 – Lecture 6

View service Maintains current membership of primary-backup service (called view) View number, primary, backup When does view service change view? When primary or any backup fails Periodically exchange heartbeat messages to detect failures What if view service is down or not reachable? September 25, 2017 EECS 498 – Lecture 6

Transitioning between views How to ensure new primary is up-to-date? Only promote a previous backup This is why view service needs to pick backups How does view service know if a backup is up-to-date? Two scenarios for ill-timed primary failure: Primary applies operation but fails before syncing with backup Primary fails before new backup is initialized September 25, 2017 EECS 498 – Lecture 6

Transitioning between views View service broadcasts view change to all Primary must ACK new view once backup is up-to-date Two implications: Liveness detection timeout > State transfer time Cannot change view if primary fails during sync September 25, 2017 EECS 498 – Lecture 6

View service View change has three steps: View service announces new view Primary syncs with new backup if there is one Primary acknowledges new view Stuck if primary fails in the midst of this process September 25, 2017 EECS 498 – Lecture 6

Scalability of View service Does every client need to contact view service before any operation? Clients can cache view across operations When to invalidate cached view? Client invalidates cache when no response or negative response from primary September 25, 2017 EECS 498 – Lecture 6

Split Brain View service Client S2 S1 (1,S1, _) (2,S1,S2) (3,S2, _) September 25, 2017 EECS 498 – Lecture 6

Avoiding Split Brain Primary must forward all operations to backups Goal: Get ACKs from backups that they too recognize primary Why can’t backups be mistaken about who is primary? Only a backup can be promoted as primary September 25, 2017 EECS 498 – Lecture 6

View service Valid sequence of views: (1, S1, _)  (2, S1, S2)  (3, S1, S3)  (4, S3, S4)  (5, S4, _) Examples of invalid transitions between views? (1, S1, S2)  (2, S3, S4) (1, S1, S2)  (2, _, S2) (1, S1, _)  (2, S2, S1) September 25, 2017 EECS 498 – Lecture 6