Scalable Algorithms for Global Snapshots in Distributed Systems

Slides:



Advertisements
Similar presentations
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Advertisements

Distributed Snapshots: Determining Global States of Distributed Systems - K. Mani Chandy and Leslie Lamport.
Global States.
Global States in a Distributed System By John Kor and Yvonne Cheng.
Global States and Checkpoints
Lecture 8: Asynchronous Network Algorithms
SES Algorithm SES: Schiper-Eggli-Sandoz Algorithm. No need for broadcast messages. Each process maintains a vector V_P of size N - 1, N the number of processes.
Parallel and Distributed Simulation Global Virtual Time - Part 2.
Time Warp: Global Control Distributed Snapshots and Fossil Collection.
Uncoordinated Checkpointing The Global State Recording Algorithm.
Uncoordinated Checkpointing The Global State Recording Algorithm Cristian Solano.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Distributed Computing 5. Snapshot Shmuel Zaks ©
Distributed Snapshot (continued)
Distributed Systems Dinesh Bhat - Advanced Systems (Some slides from 2009 class) CS 6410 – Fall 2010 Time Clocks and Ordering of events Distributed Snapshots.
CS 582 / CMPE 481 Distributed Systems
Causality & Global States. P1 P2 P Physical Time 4 6 Include(obj1 ) obj1.method() P2 has obj1 Causality violation occurs when order.
Ordering and Consistent Cuts Presented By Biswanath Panda.
Distributed Systems Fall 2009 Logical time, global states, and debugging.
CS603 Process Synchronization February 11, Synchronization: Basics Problem: Shared Resources –Generally data –But could be others Approaches: –Model.
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
Distributed Computing 5. Snapshot Shmuel Zaks ©
Chapter 9 Global Snapshot. Global state  A set of local states that are concurrent with each other Concurrent states: no two states have a happened before.
1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.
Distributed Snapshot. Think about these -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes?
An Efficient Decentralized Algorithm for the Distributed Trigger Counting Problem (DTCP) Venkatesan Chakravarthy, IBM India Research Lab Anamitra R. Choudhury,
An Efficient Decentralized Algorithm for the Distributed Trigger Counting (DTC) Problem Venkatesan T. Chakravarthy (IBM Research-India) Anamitra Roy Choudhury.
Distributed Systems Fall 2010 Logical time, global states, and debugging.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Snapshot. One-dollar bank Let a $1 coin circulate in a network of a million banks. How can someone count the total $ in circulation? If not.
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 5 Instructor: Haifeng YU.
Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation Author: Friedermann Mattern Presented By: Shruthi Koundinya.
Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems On Building Reliable Concurrent Systems Vijay.
Distributed Systems Lecture 6 Global states and snapshots 1.
Global state and snapshot
Consistent cut A cut is a set of events.
Global State Recording
Global state and snapshot
Lecture 3: State, Detection
Vector Clocks and Distributed Snapshots
CSE 486/586 Distributed Systems Global States
Theoretical Foundations
Distributed Snapshots & Termination detection
Lecture 9: Asynchronous Network Algorithms
Distributed Snapshot.
Outline Distributed Mutual Exclusion Distributed Deadlock Detection
Global State and Gossip
Global State Recording
EECS 498 Introduction to Distributed Systems Fall 2017
Distributed Snapshot.
湖南大学-信息科学与工程学院-计算机与科学系
Outline Theoretical Foundations - continued Lab 1
Non-Distributed Excercises
Distributed Snapshot Distributed Systems.
Uncoordinated Checkpointing
Slides for Chapter 11: Time and Global State
ITEC452 Distributed Computing Lecture 8 Distributed Snapshot
Distributed Snapshot.
CSE 486/586 Distributed Systems Global States
Jenhui Chen Office number:
CIS825 Lecture 5 1.
Consistent cut If this is not true, then the cut is inconsistent
Slides for Chapter 14: Time and Global States
Chandy-Lamport Example
Distributed Snapshot.
Presentation transcript:

Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish Sabharwal IBM India Research Lab

Motivation for Global Snapshot Checkpoint to tolerate faults Take global snapshot periodically On failure, restart from the last checkpoint Global property detection Detecting deadlock, loss-of-a-token etc. Distributed Debugging Inspecting the global state

Global Snapshot Global state Key requirement: Consistency A set of local states States of channels between processes Messages in transit in the global snapshot Key requirement: Consistency

Consistent and inconsistent cuts P1 m1 m3 P2 m2 P3 G2 G1 G1 is not consistent G2 is consistent but m3 must be recorded

Model of the System No shared clock No shared memory Processes communicate using messages Messages are reliable No upper bound on delivery of messages

Checkpoint Classification of Messages w – white process (pre-recording local state) r – red process (post-recording) e.g. rw – sent by a red process, received by a white process P rw rr ww wr Q A process must be red to receive a red message A white process turns red on receiving a red message Any white message received by a red process must be recorded as in-transit message

Previous Work Chandy and Lamport’s algorithm Mattern’s algorithm Assumes FIFO channels Requires one message (marker) per channel Marker indicates the end of white messages Mattern’s algorithm Schulz, Bronevetsky et al. Work for non-FIFO channels Require a message that indicates the total number of white messages sent on the channel

Results Algorithm Message Complexity Message Size Space CLM O(N2) O(1) Grid-based O(N3/2) O(N) Tree-based O(N log N log W/n) Centralized O(N log W/n)

Grid-based Algorithm Idea 1 Idea 2 Previously: send number of white messages/channel This algorithm: the total number of white messages destined to a process Idea 2 Previously: send N messages of size O(1) Now: send N messages of size N

Grid-based Algorithm Algorithm for P(r,c) [ 1 0 3 ] [ 2 1 0 ] [ 4 0 0 ] 1 0 3 2 1 0 4 0 0 whiteSent = Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c Send this count to P(c,c) Step 3: if (r=c) // diagonal entry Receive count from all processes in the column Send jth entry to P(c,j)

Grid-based Algorithm Algorithm for P(r,c) [ 1 2 3 ] [ 2 1 0 ] [ 1 4 1 ] Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c Send this count to P(c,c) Step 3: if (r=c) // diagonal entry Receive count from all processes in the column Send jth entry to P(c,j)

Grid-based Algorithm + Algorithm for P(r,c) For each processor of second row: Count of messages sent to it from processors in third row + [ 1 2 3 ] [ 2 1 0 ] [ 1 4 1 ] [ 4 7 4 ] Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c Send this count to P(c,c) Step 3: if (r=c) // diagonal entry Receive count from all processes in the column Send jth entry to P(c,j)

Grid-based Algorithm Algorithm for P(r,c) [ 4 7 4 ] Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c Send this count to P(c,c) Step 3: if (r=c) // diagonal entry Receive count from all processes in the column Send jth entry to P(c,j)

Grid-based Algorithm + Algorithm for P(r,c) [ 2 1 2 ] [ 1 0 1 ] [ 4 7 4 ] Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c Send this count to P(c,c) Step 3: if (r=c) // diagonal entry Receive count from all processes in the column Send jth entry to P(c,j)

Grid-based Algorithm Algorithm for P(r,c) For each processor of second row: Count of messages sent to it from all processors [ 7 8 6 ] Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c Send this count to P(c,c) Step 3: if (r=c) // diagonal entry Receive count from all processes in the column Send jth entry to P(c,j)

Tree/Centralized Algorithms Idea Previously: maintain white messages sent for every destination These algorithms: nodes maintain local deficits Local deficit = white messg sent – white messg recvd Total deficit = Sum of all local deficits Distributed Message Counting Problem W in-transit messages destined for N processors Detect when all messages have been received W tokens: a token is consumed when a message is received

Tree/Centralized Algorithms Distributed Message Counting Algorithm Arrange nodes in suitable data structure Distribute tokens equally to all processors at start w = W/n Each node has a color: Green (Rich) : has more than w/2 tokens Yellow (Debt-free) : has <= w/2 tokens Orange (Poor) : has no tokens and has received a white message

Tree-based Algorithm: High level idea Arrange nodes as a binary tree Progresses in rounds In each round all the nodes start off rich A token is consumed on receiving a message Debt-free node cannot have a rich child Ensured by transfer of tokens Starting a new round When root is no longer rich  ½ tokens consumed

Tree-based Algorithm Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Tree-based Algorithm - Example Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Tree-based Algorithm - Example Violates I1 Swap Request Swap Accept Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Tree-based Algorithm - Example Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Tree-based Algorithm - Example Split Request Split Accept Violates I3 Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Tree-based Algorithm - Example Violates I2 Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Tree-based Algorithm - Example Violates I2 Reset Round Recalculate remaining tokens W’ ( <= nw/2 = W/2 ) Start new round with W’ Redistribute tokens equally  All nodes turn Green

Tree-based Algorithm – Analysis Number of rounds If W < 2n, only O( n ) messages are required Tokens reduce by half in every round # of rounds = O( log W/n ) Number of control messages per round O( log n ) control messages per color change Whenever color changes, some green node turns yellow  O( n ) color changes per round # of control messages per round = O( n log n ) Total control messages = O( n log n log W/n )

Centralized Algorithm Idea In tree-based algorithm, every color change requires search for a green node to split/swap tokens with Requires O( log n ) control messages Can we find a green node with O(1) control messages? Master node (tail) maintains list of all green nodes Master

Centralized Algorithm - Example Swap Request Master Swap Accept Swap Request Master

Centralized Algorithm - Example Split Request Master Split Accept Split Request Master

Centralized Algorithm – Analysis Number of rounds If W < 2n, only O( n ) messages are required Tokens reduce by half in every round # of rounds = O( log W/n ) Number of control messages per round O( 1 ) control messages per color change Whenever color changes, some green node turns yellow  O( n ) color changes per round # of control messages per round = O( n ) Total control messages = O( n log W/n )

Lower Bound Observation Suppose there are W outstanding tokens Some process must generate a control message on receiving W/n white messages W/n Send W/n white messages to that processor Remaining tokens = (n-1)W/n Repeat Argument recursively Tokens remaining after i control messages >= ((n-1)/n)i . W # of control messages =  ( n log W/n )

Experimental Results

Experimental Results

Conclusions Global Snapshots in distributed systems Open Problem Distributed Message Counting problem Optimal algorithm Message Complexity O( n log W/n ) Matching lower bound Centralized algorithm Open Problem Decentralized algorithm ?

Thank You Questions?