Add Fault Tolerance – order & time Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport Optimal Clock Synchronization T.K. Srikanth.

Slides:



Advertisements
Similar presentations
Time, Clocks, and the Ordering of Events in a Distributed System
Advertisements

Chapter 8 Fault Tolerance
Synchronization.
Failure detector The story goes back to the FLP’85 impossibility result about consensus in presence of crash failures. If crash can be detected, then consensus.
Computer Science 425 Distributed Systems CS 425 / ECE 428 Consensus
The Byzantine Generals Problem Boon Thau Loo CS294-4.
Time and Clock Primary standard = rotation of earth De facto primary standard = atomic clock (1 atomic second = 9,192,631,770 orbital transitions of Cesium.
Byzantine Generals Problem: Solution using signed messages.
Computer Science 425 Distributed Systems CS 425 / ECE 428  2013, I. Gupta, K. Nahrtstedt, S. Mitra, N. Vaidya, M. T. Harandi, J. Hou.
Distributed Systems Spring 2009
Time in Embedded and Real Time Systems Lecture #6 David Andrews
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison-Wesley Copyright © George Coulouris, Jean Dollimore, Tim.
Randomized Byzantine Agreements (Sam Toueg 1984).
Synchronization Clock Synchronization Logical Clocks Global State Election Algorithms Mutual Exclusion.
Replication Management using the State-Machine Approach Fred B. Schneider Summary and Discussion : Hee Jung Kim and Ying Zhang October 27, 2005.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Clock Synchronization Ken Birman. Why do clock synchronization?  Time-based computations on multiple machines Applications that measure elapsed time.
Josef WidderBooting Clock Synchronization1 The  - Model, and how to Boot Clock Synchronization in it Josef Widder Embedded Computing Systems Group
Time, Clocks and the Ordering of Events in a Distributed System - by Leslie Lamport.
EEC-681/781 Distributed Computing Systems Lecture 10 Wenbing Zhao Cleveland State University.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
EEC-681/781 Distributed Computing Systems Lecture 10 Wenbing Zhao Cleveland State University.
Time Supriya Vadlamani. Asynchrony v/s Synchrony Last class: – Asynchrony Event based Lamport’s Logical clocks Today: – Synchrony Use real world clocks.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
Distributed Systems Foundations Lecture 1. Main Characteristics of Distributed Systems Independent processors, sites, processes Message passing No shared.
TIME, CLOCKS AND THE ORDERING OF EVENTS IN A DISTRIBUTED SYSTEM L. Lamport Computer Science Laboratory SRI International.
1 Clock Synchronization Ronilda Lacson, MD, SM. 2 Introduction Accurate reliable time is necessary for financial and legal transactions, transportation.
Lecture 2-1 CS 425/ECE 428 Distributed Systems Lecture 2 Time & Synchronization Reading: Klara Nahrstedt.
1 Physical Clocks need for time in distributed systems physical clocks and their problems synchronizing physical clocks u coordinated universal time (UTC)
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Parallel and Distributed Simulation Synchronizing Wallclock Time.
Synchronization. Why we need synchronization? It is important that multiple processes do not access shared resources simultaneously. Synchronization in.
Global Time in Distributed Real-Time Systems Dr. Konstantinos Tatas.
CS603 Clock Synchronization February 4, What is the best we can do? Lundelius and Lynch ‘84 Assumptions: –No failures –No drift –Fully connected.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Communication & Synchronization Why do processes communicate in DS? –To exchange messages –To synchronize processes Why do processes synchronize in DS?
Distributed Systems Principles and Paradigms Chapter 05 Synchronization.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport Massachusetts Computer Associates,Inc. Presented by Xiaofeng Xiao.
Time This powerpoint presentation has been adapted from: 1) sApr20.ppt.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Synchronization CSCI 4900/6900. Importance of Clocks & Synchronization Avoiding simultaneous access of resources –Cooperate to grant exclusive access.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Physical clock synchronization Question 1. Why is physical clock synchronization important? Question 2. With the price of atomic clocks or GPS coming down,
Time and global states Chapter 11. Outline Introduction Clocks, events and process states Synchronizing physical clocks Logical time and logical clocks.
CS 3471 CS 347: Parallel and Distributed Data Management Notes13: Time and Clocks.
CS 347Notes 121 CS 347: Parallel and Distributed Data Management Notes12: Time and Clocks Hector Garcia-Molina.
Hwajung Lee. Primary standard = rotation of earth De facto primary standard = atomic clock (1 atomic second = 9,192,631,770 orbital transitions of Cesium.
Lecture 1: Logical and Physical Time with some Applications Anish Arora CSE 6333 Notes include material from Dr. Jeff Brumfield.
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
Distributed Systems Lecture 5 Time and synchronization 1.
CSE 486/586 CSE 486/586 Distributed Systems Time and Synchronization Steve Ko Computer Sciences and Engineering University at Buffalo.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Proof of liveness: an example
Distributed Computing
Time and Clock.
Agreement Protocols CS60002: Distributed Systems
Time and Clock.
Maya Haridasan April 15th
TIME, CLOCKS AND THE ORDERING OF EVENTS IN A DISTRIBUTED SYSTEM
Chapter 5 (through section 5.4)
Physical clock synchronization
Outline Theoretical Foundations
Presentation transcript:

Add Fault Tolerance – order & time Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport Optimal Clock Synchronization T.K. Srikanth and Sam Toueg Presenter: Feng Shao (Some slides borrowed from Lamport)

Why do we care about the “Time” in a distributed system? l May need to know the time of day at which some event happens on a specific computer l external clock synchronization l For two events that happened on different computers l May need to know the relative order l May need to know time interval l internal clock synchronization

Physical Clocks l Every computer contains a physical clock l A clock is an electronic device that counts oscillations in a crystal at a particular frequency l Count is typically divided and stored in a computer register l Clock can be programmed to generate interrupts at regular intervals. l This value can be used to timestamp an event on that computer l Two events will have different timestamps only if clock resolution is sufficiently small l Many applications are interested only in the order of events, not the exact time of day at which they occurred.

Physical Clocks in Distributed Systems l Does this work? l Synchronize all the clocks to some known high degree of accuracy, and then l Measure time relative to each local clock to determine order between two events l Well, there are some problems… l It’s difficult to synchronize the clocks l Crystal-based clocks tend to drift over time-count time at different rates, and diverge from each other l Physical variations in the crystals, temperature variations, etc. l Drift is small, but adds up over time l For quartz crystal time, typical drift rate is about one second every 10 6 seconds=11.6days l Best atomic clocks have drift rate of one second in seconds = 300,000 years

Logical Clocks l Idea — abandon idea of physical time l For many purposes, it is sufficient to know the order in which events occurred l Lamport (1978) — introduce logical (virtual) time, to provide consistent event ordering

TIME, CLOCKS AND THE ORDERING OF EVENTS IN A DISTRIBUTED SYSTEM Leslie Lamport

THE PAPER l Handles the problem of clock drift in distributed systems l Identify main function of computer clocks l How to order events l Indicates which conditions clocks must satisfy to fulfill their role l Introduces logical clocks

ORDERING EVENTS l Event ordering linked with concept of causality: l Saying that event a happened before event b is same as saying that event a could have affected the outcome of event b l If events a and b happen on processes that do not exchange any data, their exact ordering is not important

Relation “has happened before” (I) l Smallest relation satisfying the three conditions: l If a and b are events in the same process and a comes before b, then a  b l If a is the sending of a message by a process and b its receipt by another process then a  b l If a  b and b  c then a  c.

Example (I) Process i Process k Process jX X X X X a c b d e

Example (II) l From first condition l a  d l c  e l From second condition l a  c l b  e l From third condition l a  e

Relation “has happened before” (II) l We cannot always order events: relation “has happened before” is only a partial order l If a did not happen before b, it cannot causally affect b.

Logical clocks l Verify the clock condition: l if a  b then C and the two sub-conditions: l if a and b are events in process P i and a comes before b, then C i, l if a is the sending of a message by P i and b its receipt by P j then C i,

Implementation rules l Each process P i increments its clock C i between two consecutive events, l If a is the sending of a message m by P i then m includes a timestamp T m = C i when P j receives m, it sets its clock to a value greater than or equal to its present value and greater than T m.

Defining a total order l We can define a total ordering on the set of all system events a  b if either C i or C i = C j and P i < P j. l This ordering is not unique

Anomalous behaviors l Logical clocks have anomalous behaviors in the presence of outside interactions l carrying a diskette from one machine to another l dictating file changes over the phone l Must use physical clocks

Example Process i Process k Process jX X X X X a c b d e outside interaction

Strong clock condition Let S be set of all systems events plus the relevant external events For any events a, b in S, if a  b then C

Physical clock conditions l There is a constant k << 1 such that for all i: |d C i (t)/dt - 1| < k The clock is neither too fast nor too slow l There is a constant  such that for all i, j: l |C i (t) - C j (t)| <  The clocks are more or less synchronized

Observations l Like logical clocks, physical clocks cannot be rolled back l Required accuracy of a physical clock depends on the minimum transmission delay of outside interactions l If it takes 20 minutes to carry a diskette between two machines their clocks can be off by up to 20 minutes

Example Process i Process jX X X 11:30 am d OK 11:15 am X 11:30 am NO 20 minutes

Optimal Clock Synchronization T. K. Srikanth and Sam Toueg

Why do clock synchronization? l Time-based computations on multiple machines l Applications that measure elapsed time l Agreeing on deadlines l Real time processes may need accurate timestamps l Many applications require that clocks advance at similar rates l Real time scheduling events based on processor clock l Setting timeouts and measuring latencies l Ability to infer potential causality from timestamps

Famous example l Scud rockets launched by Iraq towards Israel l Ground-based Patriot missiles fire back l But missiles always missed the warhead! l Why?

Famous example l Scud rockets launched by Iraq towards Israel l Ground-based Patriot missiles fire back l But missiles always missed the warhead! l Why? l After 72 hours of waiting control system was out of sync relative to Patriot guidance system l “be at (x,y,z) at time t” was misinterpreted!

Synchronization with failures A process is faulty if its behavior deviates from that prescribed by the algorithm it is running. 1. Crash: The process stops and does nothing from that point. 2. Send omission: The process crashes or omits to send messages that it is supposed to send. 3. Receive omission: The process crashes or does not receive messages sent to it. 4. General omission: The faulty process is subject to send omissions, receive omissions, or both. 5. Arbitrary (sometimes called Byzantine ): The faulty process can exhibit any behavior, including malicious actions that will cause the system to fail.

The System Model l Hardware clocks l Physical clock of process q designated R q (t) l Clocks have a drift rate ρ: l (1+ ρ) -1 (t 2 -t 1 )  R p (t 2 )- R p (t 1 )  (1+ ρ) (t 2 -t 1 ) l Implies that rate of drift is bounded by d r = ρ(2+ ρ)/(1+ ρ) l For time t, general bounds: (1- ρ)t  (1+ ρ) -1 t  R(t)  (1+ ρ)t  (1- ρ) -1 t l There is a limit t del on message latency

Clock synchronization goals l A clock synchronization protocol implements a virtual clock function mapping real time t to C p (t) l Agreement condition: l |C p (t) - C q (t)|  D max for all correct p, q l D max bounds the difference between two virtual clocks running on different processors l Accuracy condition: l (1+  ) -1 t + a  C p (t)  (1+  )t +b, for constants a, b,  l Says that p’s clock must be within a linear envelope of “real time”

Clocks and True Time True Time  Clock Time  Ideal Clock Virtual Clock: C p (t) (1+  ) -1 t + a (1+  )t +b ab

Authenticated Algorithm //(not a sequential program) if received f+1 signed messages (round k) (“accept”)  C k (t):=kP+a; relay all f+1 signed messages to all fi coend cobegin if C k-1 (t) = kP  sign and broadcast (round k) fi l Solution for system of n processes, at most f of which are faulty

Observations l Why relay? l Faulty processes do not necessarily broadcast. l Why N > 2f? faulty processes correct processes N = 4, f = 2, suppose faulty processes get stuck and p, q want to resynchronize p q p, q cannot resynchronize !

Achieving Optimal Accuracy l Bound on accuracy: for any synchronization, even in the absence of faults, accuracy cannot exceed that of the underlying hardware clocks l Why algorithm 1 is not optimal? l Uncertainty of t del introduces a difference in the logical time between resyn.

Optimality (informal description) l Solution: compensate for the uncertainty of t del : If a process accepts a (round k) message early, it delays the starting of the kth clock by t del /2(1+ ρ). If it accepts the message late, it advances the starting of kth clock by t del /2(1+ ρ). l Suppose process i accepts (round k) message at time t, and let T=C k-1 (t), ß = t del /2(1+ ρ) l early: T <= kP + ß l late: T > kP+ ß l Proof of correctness: remarkably tricky, ignored here

Unauthenticated algorithm l The authenticated algorithm relies on properties of the message system: l Correctness: If at least f+1 correct processes broadcast round k messages by time t, then every correct process accepts a message by time t+t del l Unforgeability: If no correct process broadcasts a round k message by time t, then no correct process accepts the message by time t or earlier l Relay: If a correct process accepts the message round k at time t, then every correct process does so by time t+t del

Unauthenticated algorithm (II) l A broadcast primitive which has the three properties To broadcast a (round k) message, a correct process sends (init, round k) to all. for each correct process: if received (init, round k) from at least f+ 1 distinct processes  send (echo, round k) to all; received (echo, round k) from at least f+ 1 distinct processes  send (echo, round k) to all; fi if received (echo, round k) from at least 2f+ 1 distinct processes  accept (round k) fi l Requires n > 3f+1, in order to accept

N > 3f +1 faulty processes correct processes N = 5, f = 2, suppose faulty processes get stuck, all three correct processes want to resynchronize p q p, q, r never receive 2f +1 ( echo, round k), thus not accept r

Simulating Authentication l Nonauthenticated algorithm for clock synchronization for process p for round k cobegin if C k-1 (t) = kP /* ready to start Ck */  broadcast (round k) fi /* using the broadcast primitive*/ // if accepted the message (round k) /* according to the primitive */  C k (t) := kP + a fi /* start Ck */ coend Message overhead : O(n 2 )

Restricted Models of failure l Now assume arbitrary failure l For other types of failures, including crash, sr-omission, the algorithm can be easily modified to achieve the optimality in the number of fault processes.

Summary l A unified solution for synchronizing clocks. l In practice, quality of synchronization remains relatively poor l At best synchronization will be limited by quality of physical clocks, rates of physical clock drift, and uncertainty in latencies

??? //