 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Principles of Reliable Distributed Systems Lecture 1: Introduction.

Slides:



Advertisements
Similar presentations
Global States.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
Chapter 8 Fault Tolerance
CS 542: Topics in Distributed Systems Diganta Goswami.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Teaser - Introduction to Distributed Computing
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
Byzantine Generals Problem: Solution using signed messages.
CPSC 668Set 14: Simulations1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
1 Principles of Reliable Distributed Systems Lecture 6: Synchronous Uniform Consensus Spring 2005 Dr. Idit Keidar.
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Sergio Rajsbaum 2006 Lecture 3 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous (Uniform)
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Synchronous Byzantine.
Eran Bergman & Eddie Bortnikov, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 3: Fault-Tolerant.
Impossibility of Distributed Consensus with One Faulty Process Michael J. Fischer Nancy A. Lynch Michael S. Paterson Presented by: Oren D. Rubin.
CS294, YelickConsensus, p1 CS Consensus
Last Class: Weak Consistency
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
Eran Bergman & Eddie Bortnikov, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
Synchronization in Distributed Systems CS-4513 D-term Synchronization in Distributed Systems CS-4513 Distributed Computing Systems (Slides include.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Synchronous Byzantine.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Composition Model and its code. bound:=bound+1.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 8: Failure Detectors.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Lecture 8-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 16, 2010 Lecture 8 The Consensus.
Lecture #12 Distributed Algorithms (I) CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Lab 2 Group Communication Farnaz Moradi Based on slides by Andreas Larsson 2012.
Consensus and Its Impossibility in Asynchronous Systems.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 8 Instructor: Haifeng YU.
Farnaz Moradi Based on slides by Andreas Larsson 2013.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Mutual Exclusion & Leader Election Steve Ko Computer Sciences and Engineering University.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Several sets of slides by Prof. Jennifer Welch will be used in this course. The slides are mostly identical to her slides, with some minor changes. Set.
SysRép / 2.5A. SchiperEté The consensus problem.
1 © R. Guerraoui Distributed algorithms Prof R. Guerraoui Assistant Marko Vukolic Exam: Written, Feb 5th Reference: Book - Springer.
CS 425/ECE 428 Distributed Systems Nitin Vaidya. T.A.s – Persia Aziz – Frederick Douglas – Su Du – Yixiao Lin.
Introduction to distributed systems description relation to practice variables and communication primitives instructions states, actions and programs synchrony.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Operating System Reliability
Operating System Reliability
Multiprocessor Synchronization Algorithms ( )
Alternating Bit Protocol
Distributed Consensus
Distributed Systems, Consensus and Replicated State Machines
Operating System Reliability
Operating System Reliability
Active replication for fault tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Principles of Reliable Distributed Systems Lecture 1: Introduction Spring 2007 Idit Keidar

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Staff Lecturer: Idit Keidar –Office hours: Mon 14:30-15:30 Meyer 902 TA: Alex Shraer

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Material Textbooks: –Distributed Systems 2nd edition Sape Mullender (Editor), ACM Press Frontier Series, Addison Wesley –Distributed Computing; Fundamentals, Simulations and Advanced Topics Hagit Attiya and Jennifer Welch, McGraw Hill Research papers –See links on course web page Lecture slides –Do NOT cover all the material!

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Grading and Requirements Final exam – 50%-80% –Allowed material: annotated lecture slides Dry homework assignments (4 of 5) – 30% MAGEN –Will count only if difference from exam score is < 30 –Good practice for exam! –Submit individually You may discuss with others, but write by yourself Wet homework assignments – 20% TAKEF –Two assignments, larger one in Passover –Submit in pairs or individually

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Prerequisites You need background in algorithms and operating systems You need some programming experience If you do not have the prerequisites (or CS equivalents), you need explicit permission from me to take the course

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Birdseye View of Course Syllabus Distributed systems, models, basic concepts Replication, atomic broadcast The consensus problem in different models Shared memory models and storage-based systems Peer-to-peer computing

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Distributed Systems Characteristics, Issues, Availability Material: Chapter 1 of Mullender

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Characteristics Multiple computers –each having CPU, local memory, stable storage (disk), I/O to the environment Interconnections (networked system) Shared state –correct operation of the system described in terms of global invariants –maintaining these requires coordination

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Issues to Address Independent failure Unreliable communication Costly communication Insecure communication Software bugs Malicious intrusion

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Interesting System Properties Performance –metrics: latency (time complexity), overhead (message complexity), throughput Scalability Reliability –under what circumstances the system is reliable (i.e., satisfies its specification) –the probability that the system is reliable –can the user know when the system is not reliable (fail- awareness)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Advantages of Distributed (Networked) Systems Sharing of information and resources over wide geographical spread Small cost-effective computers close to data Incremental growth Management autonomy for components Independent failure

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Advantages of Centralized Systems All information and resources equally accessible Functions always work the same way Object names are always the same Easy management Goal for distributed system: provide the above abstractions

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 What About Availability? Security? Is it better to put all the eggs in one basket? –independent failure vs. fate-sharing Independent failure can be a problem: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” [Lamport 87] –but one can exploit independent failure to provide better availability And communication can fail too…

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 How do you survive failures and achieve high availability?

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Replication Multiple copies of data/service –synchronize for consistency

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Availability Overcoming independent failure with redundancy Spatial redundancy: multiple servers for the same service –failing independently –degree of replication defines availability level Temporal redundancy: repeat operations

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Examples of Reliable Distributed Computing Paradigms

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Primary-Backup (Passive) Replication “Hot” standby Client talks to primary server Primary updates backup(s) Client detects server failure using timeout –performs “fail-over” to backup server –may need to repeat last operation(s) Can be a problem with “false suspicions” Works with benign servers only

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 State Machine Replication aaa bb c Replicas are identical deterministic state machines Process operations in the same order  remain consistent

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 State Machine (Active) Replication Clients send updates to all servers All servers are identical deterministic state machines –perform operations in the same order to remain consistent May be slower than primary backup, but provides quicker, smoother fail-over Can overcome false suspicions and tolerate malicious servers

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Supporting State Machine Replication Send update operations in messages so that –messages are reliable –messages arrive in the same order to all replicas This is called Atomic Broadcast It requires the receivers to agree on the message order Consensus is a service that lets processes reach agreement

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Peer-to-Peer Systems Decentralized, self-organizing, distributed systems in which most communication is symmetric Popularized by music swapping software –Napster, Gnutella, KaZaA Lots of nodes (e.g., millions) Dynamic: frequent join, leave, failure Little or no infrastructure (no central server) All nodes are “peers” – have same role; don’t have lots of resources

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Data-Centric Systems Ephemeral processes (clients) are not all around at the same time Clients share state (data) Clients synchronize among each other using the shared data Data can be stored at “dumb” shared disks

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Example: Coordinated Attacks

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Example: Coordinated Attack Let’s attack A B

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 The Coordinated Attack Problem Requirements: –both generals must decide the same: either to attack or not to attack –if both are not ready to attack they must not attack –if both are ready to attack then they must attack Motivation: atomic transaction commit in distributed databases [Gray 78]

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Properties of Coordinated Attack Agreement: If both generals decide, they decide the same Termination: Every general eventually decides Validity: If both inputs are “not ready” the decision is “no attack”; if both inputs are “ready” then the decision is “attack”

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 A Simple Solution General A sends vote (“yes” or “no”) General B responds with his vote If both say yes, they attack Otherwise they do not Aka 2-phase commit Problems?

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Failure Model 1 Generals may die Subordinates eventually replace them, need to know correct result Crash-recovery model

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Failure Model 2 Any number of messengers can be captured (message loss) Proof on board

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Coordinated Attack Definition: Take II Revised requirements: –both generals must decide the same: either to attack or not to attack –if both are not ready to attack they must not attack –if both are ready to attack and no messages are lost then they must attack Note: this is not an assumption about the model. It’s a conditional requirement that has to hold only in runs in which no messages are lost. Proof on the board!

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Models Material: Chapter 2 of Mullender

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Theory vs. Practice Distributed systems are not as intuitive as centralized ones Two approaches to understanding them: –Experimental observation (“practice”) –Modeling and analysis (“theory”) to be useful, models need to characterize reality Complement each other

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Theory Needs Models In order to design an algorithm, we need a model of the system where the algorithm will be deployed Model captures assumptions on process capabilities, timing, types and number of failures. –E.g., assume that at most one server crashes –This means that the system is allowed to be unreliable if other failures occur (two servers crash, one server is infiltrated)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Good Models Accurate – analysis yields truths about the analyzed system/object Tractable – analysis is possible Accurate and tractable models are hard to define –need to abstract away issues that do not affect the phenomena of interest –include exactly those attributes that do

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Questions We’d Like to Answer 1.Feasibility – what classes of problems can be solved 2.Cost – how expensive must the solution be Computation and complexity models for centralized systems do not help us answer these questions for distributed systems

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Synchronous vs. Asynchronous Synchrony assumptions: –message latency is bounded, –processes have synchronized clocks –processing times are bounded Asynchrony: non-assumption

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Asynchronous Model Unbounded message delay, processor speed Desirable: an algorithm for this model works also in synchronous model Alas, too strong Consensus impossible even when one process can crash [FLP85]

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Round Lock-Step Synchronous Model Algorithm runs in synchronous rounds: –send messages to any set of processes, –receive messages from previous round, –do local processing (possibly decide, halt) Easy to work with But, may lead to inefficiency when implemented over slow network

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Stronger Models Is the result still valid if we assume each of the following? –bounded loss: at most 10 messages are lost on each channel –eventual delivery: an unknown finite number of messages are lost on each channel

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 We’ll talk a lot about models later in the course….

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Specifications Material: Chapter 3 of Mullender

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Specifications A specification of a module is an abstraction To use a module, all we need to know is its specification –abstract away implementation Managing complex systems via modularity requires clear component specifications

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Step 1: Define Interfaces Example: Coordinated Attack There are two generals A, B Each has an input variable inp A, inp B  {“ready”, “not ready”} Possible actions for P  {A, B}: –Decide P (v), v  {“attack”, “no attack”} (Output) writes v to a write-once variable dec P –Send P (m) (Output) –Deliver P (m) (Input)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 System Example: A Shared Counter Interface: inc P () P  {A, B} Code: –int x, initially 0 –on inc P () do: x++

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Traces and Runs A run is an alternating sequence of events and states the system goes through –example run of the counter: 0, inc A (), 1, inc B (), 2, inc B (), 3, inc B (), 4 A trace is the sequence of events in a run –trace of the above run: inc A () inc B () inc B () inc B () –e.g., trace of a coordinated attack algorithm: send A (“yes”) send B (“no”) deliver B (“yes”) deliver A (“no”) decide B (“no attack”) decide A (“no attack”)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Sequences, Prefixes, and Predicates Sequence: a 1 a 2 a 3 a 4 a 5,… Prefixes of this sequence: a 1, a 1 a 2, a 1 a 2 a 3, etc. A predicate is a formula evaluated to a boolean value (true or false) The predicate “if m is delivered it was previously sent” evaluates to true over the trace: send A (“yes”) send B (“no”) deliver B (“yes”) deliver A (“no”) decide B (“no attack”) decide A (“no attack”)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Step 2: Specify Properties Concurrent systems can be specified using trace properties A trace property is a predicate evaluated over a trace of the concurrent system Example property: every message that is received was previously sent Not a property: the average number of messages sent in a run is 34

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Safety Properties A safety property is of the form nothing bad happens (that is, all states are safe). Examples: –The number of processes in a critical section is always less than 2 –No two processes decide on different values –Every delivered message was previously sent

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Liveness Properties A liveness property is of the form something good happens (that is, an interesting state is eventually achieved) Examples: –A process that wishes to enter the critical section eventually does so –p grows without bound –Every general eventually decides

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 More Formally A safety property is prefix-closed: –if it holds in a run, it holds in every prefix –you can’t “fix” it after it’s “broken” Every run can be extended to satisfy a given liveness property: –no matter how “broken”, you can always “fix” it Any property is either a safety property, a liveness property, or equivalent to a conjunction of a safety and a liveness property –e.g., Critical Section is a conjunction of Mutual Exclusion and Progress

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Timing Properties If a message is sent, then it arrives within five minutes. Safety or liveness? Can be expressed only in a timed model

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 Properties of Coordinated Attack Agreement: If both generals decide, they decide the same Termination: Every general eventually decides Validity: If both inputs are “not ready” the decision is “no attack”; if both inputs are “ready” and every message sent is delivered then the decision is “attack”