The Θ-Model Ulrich Schmid Josef Widder Martin Hutle Daniel Albeseder

Slides:



Advertisements
Similar presentations
CS 542: Topics in Distributed Systems Diganta Goswami.
Advertisements

Teaser - Introduction to Distributed Computing
PROTOCOL VERIFICATION & PROTOCOL VALIDATION. Protocol Verification Communication Protocols should be checked for correctness, robustness and performance,
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
Byzantine Generals Problem: Solution using signed messages.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison-Wesley Copyright © George Coulouris, Jean Dollimore, Tim.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Clock Synchronization Ken Birman. Why do clock synchronization?  Time-based computations on multiple machines Applications that measure elapsed time.
Josef WidderBooting Clock Synchronization1 The  - Model, and how to Boot Clock Synchronization in it Josef Widder Embedded Computing Systems Group
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
Josef Widder1 Why, Where and How to Use the  - Model Josef Widder Embedded Computing Systems Group INRIA Rocquencourt, March 10,
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
Composition Model and its code. bound:=bound+1.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
Consensus and Its Impossibility in Asynchronous Systems.
CS603 Clock Synchronization February 4, What is the best we can do? Lundelius and Lynch ‘84 Assumptions: –No failures –No drift –Fully connected.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport Massachusetts Computer Associates,Inc. Presented by Xiaofeng Xiao.
Time This powerpoint presentation has been adapted from: 1) sApr20.ppt.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
CS 3471 CS 347: Parallel and Distributed Data Management Notes13: Time and Clocks.
Introduction to distributed systems description relation to practice variables and communication primitives instructions states, actions and programs synchrony.
Serial Communications
The consensus problem in distributed systems
When Is Agreement Possible
Distributed Computing
Lecture 17: Leader Election
Chapter 6: CPU Scheduling
Chapter 6: CPU Scheduling
Logical time (Lamport)
Module 5: CPU Scheduling
Alternating Bit Protocol
Distributed Consensus
Agreement Protocols CS60002: Distributed Systems
Distributed Systems, Consensus and Replicated State Machines
Maya Haridasan April 15th
3: CPU Scheduling Basic Concepts Scheduling Criteria
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Presented By: Md Amjad Hossain
Active replication for fault tolerance
On-time Network On-chip
PERSPECTIVES ON THE CAP THEOREM
Chapter 6: CPU Scheduling
Virtual-Time Round-Robin: An O(1) Proportional Share Scheduler
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Chapter 5 (through section 5.4)
Physical clock synchronization
CDK: Sections 11.1 – 11.4 TVS: Sections 6.1 – 6.2
Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Chapter 6: CPU Scheduling
Logical time (Lamport)
EEC 688/788 Secure and Dependable Computing
Module 5: CPU Scheduling
Logical time (Lamport)
Chapter 6: CPU Scheduling
Chapter 13: I/O Systems.
Distributed systems Consensus
Module 5: CPU Scheduling
Presentation transcript:

The Θ-Model Ulrich Schmid Josef Widder Martin Hutle Daniel Albeseder Vienna University of Technology Embedded Computing Systems Group http://www.ecs.tuwien.ac.at Gérard Le Lann Jean-François Hermant INRIA Rocquencourt Project Novaltis http://www.inria.fr 2/22/2019 The Theta-Model (Version 1.3)

Motivation The Theta-Model

Timed Algorithms Most FT algorithms for distributed RTS have explicit time values (unit „seconds“) in their code / variables Toy example: Local real-time clock for timing out a crashed process msg_pong = do_roundtrip(msg_ping, p) send msg_ping to p TIMEOUT := C(t) + 2τ+ /* max. e.-t.-e. delay τ+ (sec) */ while C(t) < TIMEOUT do nothing if msg_pong did not arrive then msg_pong := NIL return msg_pong The Theta-Model

Implications ? Safety properties like consistency of replicated data may depend upon non-NIL operation of do_roundtrip Usual assumption: Real-time systems must always meet their timeliness properties Only possible if all end-to-end delays δ ≤ τ+ Safety properties also guaranteed in this case BUT: Bounds like τ+ that always hold are very difficult to determine for real systems Fail-operational systems might be allowed to sometimes lose timeliness – but never lose consistency The Theta-Model

Why is determining τ+ difficult ? Queuing phenomenons: Simultaneous messages from different peers (CPU) Multiple processes (CPU) Multiple messages (Link) End-to-end delays hence depend upon message & computational complexity of algorithms interaction („blocking factors“) load conditions scheduling disciplines The Theta-Model

Importance of Scheduling ? τ+ can be huge in real systems since all messages [including application-level] must be taken into account Maximum determines synchronous round duration  too conservative for most messages Escape: Appropriate scheduling Fast Failure Detectors by Hermant & Le Lann [HLL02] Use Head-of-the-Line Scheduling for FD-level processes and messages Only blocking factors due to non-preemptible resources can lead to priority inversion phenomenons on FD-level τ+ relevant for failure detection latency reduced by orders of magnitude The Theta-Model

(Note that woutQ, woutq and winq are the problematic parts here) But still … Hermant & Le Lann [HLL02]: τ+ = γ(n) with (Note that woutQ, woutq and winq are the problematic parts here) Do you trust a real system to always obey this, during the whole mission time? Do you really want your safety and liveness properties to depend on this? The Theta-Model

YES: Asynchronous algorithms (time-free, message-driven) Alternatives ? Are there ways to guarantee logical safety & liveness properties independently of the timing properties of the underlying system ? YES: Asynchronous algorithms (time-free, message-driven) Are there suitable time-free computational models and algorithms ? YES: Θ-Model The Theta-Model

Roadmap of our Presentation Overview of Computational Models ₪ The Θ-Model First Experimental Results Applications The Theta-Model

Overview of Computational Models The Theta-Model

The FLP Asynchronous Model (I) Fischer, Lynch & Paterson [FLP85] System of n processes communicating via reliable point-to-point network Every message sent is eventually delivered No bounded-drift clocks available Computational step times are non-negative, finite but unbounded (i.e., can exceed any a priori given bound) Message transmission delays are non-negative, finite but unbounded The Theta-Model

The FLP Asynchronous Model (II) FLP model has no timing assumption at all  cannot be violated at runtime BUT: In the FLP-Model, it is impossible to distinguish a slow from a crashed process Important DC problems like consensus impossible to solve in the FLP-Model in the presence of failures For solvability, some property/properties must be added to the pure FLP model. The Theta-Model

The FLP Asynchronous Model (III) Resulting spectrum of models: FLP  partially synchronous  synchronous Clearly: The stronger the added property the less is the assumption coverage in real systems Usually: Add explicit timeliness properties to the FLP- Model Sometimes: Add implicit timeliness properties to the FLP-Model (time-free models) The Theta-Model

(Close to) Synchronous Models Synchronous model  allows simulation of lock-step rounds Transmission delay bound Δ Computing step time bound σ Bounded-drift local clocks available Timed Asynchronous Model by Cristian & Fetzer [CF99] BUT: Fail awareness allows bounds Δ and σ to be violated arbitrarily often  fail-safe behavior The Theta-Model

Partially Synchronous Models (I) Dwork, Lynch & Stockmeyer [DLS88], Ponzio & Strong [PS92], Attiya, Dwork, Lynch & Stockmeyer [ADLS94] Transmission delay bound Δ Bounded ratio of max. over min. computing step times Φ Bounds unknown / known but hold from unknown time GST on Every process can locally time-out messages: [PS91, ADLS94]: Semi-synchonous model assumes availability of bounded-drift local clocks [DLS88]: Computing steps of fastest processor are used as real-time units [= unit of Δ !]  local clock with bounded rate  [1/Φ,1] implementable via spin-loop The Theta-Model

Partially Synchronous Models (II) Archimedean model by Vitany [Vit85] Bounded ratio s ≥ u/m on min. computing step time (m) and max. computing step time + max. transmission delay (u) s is dimensionless Every process can again locally time-out messages [via spinning for s steps] Finite Average Round-Trip-Time Model by Fetzer & Schmid [FS04] Unknown lower bound for computing step time Stubborn links with unknown average round-trip time bound Every process can implement „weak clock“ via spin-loop The Theta-Model

FLP-Model with Failure Detectors Replace explicit timeliness properties by unreliable failure detectors FDs are local oracles based upon a list of suspected processes Completeness: Every crashed process is eventually suspected Accuracy: No correct process is suspected FLP-Model + FDs allow most important distributed computing problems to be solved BUT: Implementing FDs in a real system necessarily requires a system model stronger than FLP  back at initial problem The Theta-Model

The Θ-Model The Theta-Model

Time-Free Message-Timeout in ParSync ? Implementation of do_roundtrip(msg_ping, p) using a spin-loop in the parsync models of [DLS88] or [Vit85]: send msg_ping to p for i=1 to x do no-op /* x=f(Δ, Φ) resp. x=f(s) is dimensionless! */ if msg_pong did not arrive then msg_pong := NIL return msg_pong The algorithm is time-free since neither code nor variables contain real-time values (unit „seconds“) ! not message-driven The Theta-Model

But … There is the ([DLS88]: hidden, [Vit85]: explicit) assumption that all timing values/bounds are multiples of the min. computing step time (m) The algorithm would be time-free only if m could vary arbitrarily Since there is no physically evident correlation between transmission delay and computing step time, however, m cannot vary arbitrarily without violating the physical (real-time) transmission delay bound [since Δ resp. s are fixed] Assuming fixed Δ resp. s hence makes sense for essentially constant m only Not time-free in reality since m  unit real-time! The Theta-Model

Still: Can we make this idea working ? The problem with the previous algorithm is that computing step times and transmission delays are uncorrelated Key idea: Replace unit time „fastest computing step“ of [DLS88], [Vit85] by „fastest end-to-end delay“ Just assume that, during any round-trip, there may not be more that Θ other successive roundtrips (anywhere in the system) The Theta-Model

Time-free implementation of do_roundtrip(.) send msg_ping to p for i=1 to Θ do /* Θ is dimensionless ! */ begin /* do additional roundtrips for waiting */ send delay_ping(i) to process q wait for delay_pong(i) from process q end if msg_pong did not arrive then msg_pong := NIL return msg_pong The algorithm is time-free since Θ is dimensionless fully message-driven since all events are triggered by message receptions only The Theta-Model

Time-free implementation of do_roundtrip(.) q 1 2 3 4 5 Θ = 5 p D msg_ping msg_pong Timing behavior solely emerges from the underlying system [D adapts automatically to actual speed] Consider execution in a synchronous system: End-to-end delays δ satisfy τ− ≤ δ ≤ τ+ with τ+ / τ− ≤ Θ = 5  Termination within 10 τ− ≤ D ≤ 10 τ+ τ+ = 100 us  D ≤ 1 ms ◊ τ+ = 1 s  D ≤ 10 s The Theta-Model

Performance ? Is doing continuous successive round-trips for delay purposes prohibitively expensive? (a) Reasonably large delay * bandwidth product: τ+ = 1 ms with 1 Mbit/sec peer-to-peer bandwidth allows to send 1000 bit per message do_roundtrip(.) needs only a few bit of message data Only a few % overhead for continuous round-trips! (b) Small delay * bandwidth product: Use timer to separate multiple instances of do_roundtrip(.) No bounded drift timer required here  Implementable without hardware clock by counting some local events NO! The Theta-Model

The Θ-Model (Simple Version) FLP-Model + End-to-end delays δ of all messages in transit at t minimum τ−(t) maximum τ+(t) τ+(t) and τ−(t) may vary arbitrarily with time, but ratio Θ(t) = τ+(t)/τ−(t) must remain bounded by some [known or even unknown] Θ for every time t The Theta-Model

Key Question Can we indeed expect a (positive) correlation between τ+(t) and τ−(t) in a real system? Shared channel-type networks [Deterministic Ethernet]: Theoretical analysis by Hermant & Widder [HW04] has shown that Θ close to 1 can be achieved Fully connected systems: First experimental evaluation of a simple Θ clock synchronization algorithm by Albeseder [Alb04] confirms correlation The Theta-Model

Reason for such a correlation ? Restriction to broadcast communication (shared channel or multiple point to point sends in a fully connected network) (Part of) the messages populating the queues from p → q also sure/likely to populate queues from p → r, and even from s → r CPU Receiver q Chan Link p → q Link q → x Sender p t δpq= 10 Arrival at p Processed at q δpr = 7 Processed at r CPU Receiver r Chan Link p → r Link r → y Chan Link s → r CPU Sender s The Theta-Model

Correlation  Coverage Expansion Given some bound τ+ and τ− assumed during system design (also used in synchronous systems), compute Θ = τ+ / τ− Unanticipated overload: τ+(t) > τ+ — if τ+(t) ≤ Θτ−(t), however, Θ-system still OK t end-to-end delays  δ  Synchronous system out of spec Note: τ+(t) = τ+ + α(t) τ −(t) = τ + α(t)/Θ suff. for Θ to hold The Theta-Model

Still: Shortcomings Simple Θ-Model The predicted correlation need not exist for every fast message but only for some Some very fast messages [even τ− = 0] may be in transit somewhere in the system even during a slow message Correlation and hence coverage expansion does not exist in such cases Need a more relaxed definition of the relation between slow and fast messages All that is actually needed is to constrain the number of fast messages during a slow one No need for a correlation at every point in time t The Theta-Model

The Θ-Model (Generalized Version) Consider chain of k ≥ 1 successive messages Longest chain of „covered“ causal messages ≤ kΘ τ+(t1) τ+(t2) k=2 successive (slow) messages ≤ kΘ = 9 causally dependent (fast) messages  Θ = 4.5 Advantage: Messages with τ−(t) = 0 allowed here! The Theta-Model

Partial Order of Partially Synchronous Models DLS … [DLS88] with a priori known Δ, Φ Θ … Θ-Model with a priori known Θ DLSu … [DLS88] with a priori unknown Δ, Φ Θu … Θ-Model with a priori unknown Θ FLP … FLP-Model FLP Θu DLSu Θ DLS The Theta-Model

Existing Θ-Algorithms Perfect failure detectors [Schmid and Le Lann 2003] Clock synchronization (+ system booting) [Widder 2003], [Widder and Schmid 03] Eventually perfect failure detectors / system booting [Widder, Le Lann and Schmid 2003] Fast failure detectors atop of Deterministic Ethernet [Widder and Hermant 2004] Self-stabilizing failure detectors & impossibility results [Hutle and Widder 2004] Synchronizer, SDD problem, atomic commitment, etc. [Widder’s PhD 2004] The Theta-Model

http://www.ecs.tuwien.ac.at/~widder/Theta/ anks ! The Theta-Model

First Experimental Results The Theta-Model

Remember Key Question: Can we indeed expect a (positive) correlation between τ+(t) and τ−(t) in a real system? Alternatively: Let Θ = τ+ / τ− with τ− = mint τ−(t) being the total minimum for all t τ+ = maxt τ+(t) being the total maximum for all t Is it the case that Θ(t) < Θ ? How often and how much gain Θ/Θ(t) ? The Theta-Model

Evaluation Setup Master thesis by Daniel Albeseder [Alb04] Pentium4 workstations (2,4GHz FSB533) Fully switched Fast-Ethernet over two Cisco Catalyst 2950 switches (connected over fiber Gigabit-Ethernet backbone) Red Hat Linux 7.2 with 2.4.20 kernel, patched with High-Resolution-Timers and Kernel-Preemption The Theta-Model

Evaluation Parameter Settings n = 4 processors with at most f = 1 faulty ones Head-of-line process scheduling (Linux RT Priorities) High message priority (low latency bit in TOS-byte), but no head-of-the-line message scheduling Simulated broadcast (= multiple point-to-point sends) Fixed message length: 36 bytes Inter-round delay: 1ms Duration evaluation run: 10 … 100 s - range The Theta-Model

Fully switched Fast-Ethernet System Design ctrlpsa evalpsa The ctrlpsa workstation controls the network of evalpsa-clients. The evalpsa is running the algorithm to be evaluated. The fully connected network is simulated by a fully switched Fast-Ethernet. Fully switched Fast-Ethernet The Theta-Model

Control Communication ctrlpsa evalpsa Phases: boot init done booting stop start running change parameters … run algorithm collecting store done t t The Theta-Model

Evalpsa Structure The Theta-Model

Data Analysis Consider only clock synchronization messages τ−(t), τ+(t), Θ(t) etc. only evaluated at times t where some rule of the algorithm fires („effective Θ“) Approximation of one-way delays via round trip delays for simplicity (i.e., we assume that both messages of a round-trip have the same delay) The clock of one designated processor is used as global timebase, all timestamps are a-posteriori adjusted to this global timebase The Theta-Model

Glossary of variables τ−(t), τ+(t): Min. and max. delay of all messages in transit at some time t Θ(t) = τ+(t)/ τ−(t) Θ = maxt Θ(t) τ−, τ+: Min. and max. delay of all messages in transit at all times during the evaluation run Θ = τ+/ τ− Gain = Θ/Θ The Theta-Model

Θ Every testrun was repeated five-times. The maximum of this five testruns is shown here. The Theta-Model

Θ/Θ The Theta-Model

Continuously Increasing Network Load The first and last secands are cut of from the calculation routine, to compansate errors during this phases. The load was increased in 1% jumps every 2 seconds. You see low Theta values in twi periods. We speculate, that this is caused by special network-improvement functions inside the Linux-kernel as well as inside the networt interface card itself. Overall Theta dont increase with network load. The Theta-Model

Conclusions from First Experiments There is definitely a positive correlation between τ+(t) and τ−(t) in the evaluation setting, even with significant gain always achieved Although we cannot infer from this that there is always a correlation between τ+(t) and τ−(t) here, it is very likely that there are scenarios where some assumed Θ holds despite of the fact that some assumed τ+ is violated the Θ-model is very likely to have higher coverage that a synchronous solution More thorough experimental and theoretical evaluation [of more suitable systems] will follow The Theta-Model

Applications The Theta-Model

„Exotic“ Application: VLSI Chips Interconnect delays dominate over switching delays Signals cannot traverse entire chip within a single clock cycle Increasing susceptibility to transient failures (particles, cross-talk, …) High power-consumption Shrinking feature size Increasing complexity Increasing clock speed The Theta-Model

Clock Generation in Systems-on-a-Chip Illusion of chip-wide synchrony increasingly difficult to maintain Extend every functional unit with simple local CS algorithm CS algorithms communicate via dedicated clocking signals CS algs guarantee | Ci(t) – Cj(t) | ≤ π (Θ) Next tick happens every max delay Data sent by fui by tick k available at fuj by tick k+Ξ(Θ) at latest Division by Ξ provides global macro tick abstraction fu1 fu2 fu3 data bus CS algs CS network Distributed clock clock Clock tree The Theta-Model

Benefits CS algs simulate global clock Synchronous design abstraction maintained Self-clocking feature: Chip runs as fast as routing delays allow Θ is estimated by place and route tools Explicit dependence upon routing only via Θ [required for determining macro-tick division factor Ξ(Θ) only] Distributed clocks tolerate transient failures Need n > 6fl FUs for tolerating up to fl transient failures (affecting clocking signals) per FU in every tick Additional (data) fault-tolerance possible via replicated FUs employing synchronous Byzantine agreement algorithms etc. [WS03]: CS algs work also for non-simultaneous reset The Theta-Model

http://www.ecs.tuwien.ac.at/~widder/Theta/ anks ! The Theta-Model