EECS 498 Introduction to Distributed Systems Fall 2017

Slides:



Advertisements
Similar presentations
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Advertisements

Distributed Snapshots: Determining Global States of Distributed Systems - K. Mani Chandy and Leslie Lamport.
Global States.
Global States in a Distributed System By John Kor and Yvonne Cheng.
Global States and Checkpoints
Lecture 8: Asynchronous Network Algorithms
Uncoordinated Checkpointing The Global State Recording Algorithm.
CS542 Topics in Distributed Systems Diganta Goswami.
Distributed Computing 5. Snapshot Shmuel Zaks ©
Distributed Systems Dinesh Bhat - Advanced Systems (Some slides from 2009 class) CS 6410 – Fall 2010 Time Clocks and Ordering of events Distributed Snapshots.
CS 582 / CMPE 481 Distributed Systems
Distributed Systems Fall 2009 Logical time, global states, and debugging.
Slides for Chapter 10: Time and Global State
EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
Cloud Computing Concepts
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
1 Distributed Systems CS 425 / CSE 424 / ECE 428 Global Snapshots Reading: Sections 11.5 (4 th ed), 14.5 (5 th ed)  2010, I. Gupta, K. Nahrtstedt, S.
Distributed Computing 5. Snapshot Shmuel Zaks ©
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Lecture 6-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 12, 2013 Lecture 6 Global Snapshots Reading:
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Distributed Systems Fall 2010 Logical time, global states, and debugging.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Snapshot. One-dollar bank Let a $1 coin circulate in a network of a million banks. How can someone count the total $ in circulation? If not.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
Distributed Systems Lecture 6 Global states and snapshots 1.
Global state and snapshot
Consistent cut A cut is a set of events.
Advanced Topics in Concurrency and Reactive Programming: Time and State Majeed Kassis.
Global state and snapshot
EEC 688/788 Secure and Dependable Computing
Lecture 3: State, Detection
Vector Clocks and Distributed Snapshots
CSE 486/586 Distributed Systems Global States
Theoretical Foundations
Lecture 9: Asynchronous Network Algorithms
Distributed Snapshot.
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Distributed Snapshot.
湖南大学-信息科学与工程学院-计算机与科学系
EEC 688/788 Secure and Dependable Computing
Time And Global Clocks CMPT 431.
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Slides for Chapter 11: Time and Global State
EEC 688/788 Secure and Dependable Computing
ITEC452 Distributed Computing Lecture 8 Distributed Snapshot
EEC 688/788 Secure and Dependable Computing
Distributed Snapshot.
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Global States
Jenhui Chen Office number:
Distributed algorithms
CIS825 Lecture 5 1.
Consistent cut If this is not true, then the cut is inconsistent
COT 5611 Operating Systems Design Principles Spring 2014
Chandy-Lamport Example
Distributed Snapshot.
Presentation transcript:

EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha

Replicated State Machine Are we done now that we have logical clocks? Failures! Clients September 18, 2017 EECS 498 – Lecture 4

RSMs with Logical Clocks Any replica can execute an update only after confirming clock is higher on all other replicas Implication: If any one replica is down, all other replicas cannot progress September 18, 2017 EECS 498 – Lecture 4

Types of failures Crash failures Fail stop Can resume with saved state All state is lost upon failure September 18, 2017 EECS 498 – Lecture 4

Recovering from failures Checkpoint process state periodically Where to store checkpoint? Persistent storage vs. volatile memory Local machine vs. remote machine Where you store checkpoint depends on Types of failures you want to tolerate Performance overhead you are willing to bear Resume from last checkpoint after restart September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m6 m7 m1 m4 P2 m8 m2 m5 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m6 m7 m1 m4 P2 m8 m2 m5 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m6 m7 m1 m4 P2 m2 m5 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m6 m1 m4 P2 m2 m5 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m1 m4 P2 m2 m5 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m1 m4 P2 m2 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m1 P2 m2 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m1 P2 m2 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m1 P2 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing September 18, 2017 EECS 498 – Lecture 4

Checkpoint-based Recovery Problem: Domino effect Cause: Uncoordinated checkpointing Coordinated checkpointing Example: Chandy-Lamport snapshot Global snapshot of a distributed system September 18, 2017 EECS 498 – Lecture 4

Example September 18, 2017 EECS 498 – Lecture 4

Chandy-Lamport algorithm How to take a snapshot of a distributed system? Example use cases: Deadlock detection Garbage collection Evaluation of any stable property September 18, 2017 EECS 498 – Lecture 4

Example: Token ring P1 m1 P2 m2 m3 P3 September 18, 2017 EECS 498 – Lecture 4

Example snapshot 1 P1 m1 P2 m2 m3 P3 September 18, 2017 EECS 498 – Lecture 4

Example snapshot 2 P1 m1 P2 m2 m3 P3 September 18, 2017 EECS 498 – Lecture 4

Global snapshot Captured state must satisfy “happens before” If event b in snapshot and a  b, then event a must be in snapshot September 18, 2017 EECS 498 – Lecture 4

Example desired snapshot September 18, 2017 EECS 498 – Lecture 4

Global snapshot Goals of capturing global snapshot Capture instantaneous state of every process Capture relevant messages in transit What comprises a distributed system’s state? State of individual processes State of every communication channel September 18, 2017 EECS 498 – Lecture 4

Chandy-Lamport Snapshot N processes in the system Processes don’t fail while taking snapshot Any process may initiate collection of snapshot One unidirectional channel in either direction between each pair of processes All channels ensure FIFO delivery Lossless and no duplication Some of these assumptions tackled by later work September 18, 2017 EECS 498 – Lecture 4

Chandy-Lamport Snapshot September 18, 2017 EECS 498 – Lecture 4

Chandy-Lamport: Intuition Enable snapshot by broadcasting marker message Distinct from the application’s messages Marker messages serve two purposes: Enable processes to discover need for snapshot Serve as a barrier on every channel Record all messages received before marker September 18, 2017 EECS 498 – Lecture 4

Chandy-Lamport Snapshot September 18, 2017 EECS 498 – Lecture 4

Chandy-Lamport Snapshot Initiator process does the following: Records its local state Sends out marker messages on every outgoing channel Starts recording messages on every incoming channel Two cases for how Pi should handle receipt of marker message from Pj September 18, 2017 EECS 498 – Lecture 4

Chandy-Lamport Snapshot If first marker message received by Pi Record local state Record state of channel from Pj to Pi as empty Send marker messages on all outgoing channels If duplicate marker message Stop recording channel from Pj to Pi Record state of channel as all messages received since marker Snapshot complete when every process has received marker on every incoming channel September 18, 2017 EECS 498 – Lecture 4

Chandy-Lamport in action September 18, 2017 EECS 498 – Lecture 4

Coordinated checkpointing When to initiate global snapshot? External environment does not checkpoint Cannot roll back output Cannot re-issue inputs Implication: Need coordinated checkpoint upon input or output Too slow! September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m6 m7 m1 m4 P2 m8 m2 m5 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m6 m7 m1 m4 P2 m8 m2 m5 P3 September 18, 2017 EECS 498 – Lecture 4

Challenge in Checkpointing m3 m6 m7 m1 m4 P2 m8 m2 m5 P3 September 18, 2017 EECS 498 – Lecture 4

Message logging Between checkpoints, log all events that lead to non-determinism For example, log every message received Log everything necessary to ensure same messages are sent as in original execution September 18, 2017 EECS 498 – Lecture 4