Impossibility of Consensus in Distributed Systems… and other tales about distributed computing theory Nancy Lynch MIT Adriaan van Wijngaarden lecture CWI.

Impossibility of Consensus in Distributed Systems… and other tales about distributed computing theory Nancy Lynch MIT Adriaan van Wijngaarden lecture CWI 60th anniversary, February 9, 2006

1. Prologue Thank you! Adriaan van Wijngaarden: Numerical analysis, programming languages, CWI leadership. My contributions: Distributed computing theory. This talk: A general description of (what I think are) my main contributions, with history + perspective. Highlight a particular result: Impossibility of reaching consensus in a distributed system, in the presence of failures [Fischer, Lynch, Paterson 85]. I am very honored to be one of the first two recipients of the Adriaan van Wijngaarden award. Van Wijngaarden was indeed a pioneer of computer science, renowned especially for his contributions to numerical analysis and programming languages, and for his early leadership of CWI. It is a great honor to be considered worthy to receive an achievement award bearing his name. I thank everyone at CWI. I have been told that the award is for my entire body of contributions to computer science and mathematics, in particular, to distributed computing theory, rather than any single achievement. I will, therefore, try to give you a general description of what I think my main contributions have been (so far), with some history and perspective. I will highlight my best-known result---the one with Fischer and Paterson, showing the impossibility of reaching consensus in a distributed system in the presence of failures. In describing it, I will try to place it in some historical context.

2. My introduction to distributed computing theory
: Complexity theory 1978, Georgia Tech: Distributed computing theory Dijkstra’s mutual exclusion algorithm [Dijkstra 65] Several processes run, with arbitrary interleaving of steps, as if concurrently. Share read/write memory. Arbitrate the usage of a single higher-level resource: Mutual exclusion: Only one process can “own” the resource at a time. Progress: Someone should always get the resource, when it’s available and someone wants it. My early work, in , which I am sure this award is NOT for, was in the area of complexity theory. I started working in distributed computing theory in around 1978, when I was an associate professor at Georgia Tech. I was first influenced to work in this field by reading results on the mutual exclusion problem by Edsger Dijkstra (who I have learned was once hired by van Wijngaarden). Dijkstra was working on operating systems, which essentially let processes run with arbitrary interleavings of steps---with behavior similar to what they would exhibit if they were running concurrently. They were allowed to share memory, which the processes could read and write, using individual read and write steps. The problem was to arbitrate the usage of some higher-level resource, in such a way that only one process could “own it” at a time (the mutual exclusion condition). Also, someone should always get the resource when it’s available and someone wants it (a progress condition).

Dijkstra’s Mutual Exclusion algorithm
Initially: All flags = 0, turn is arbitrary. To get the resource, process i does the following: Phase 1: Set flag(i) := 1 Repeatedly: If turn = j, and flag(j) = 0, sets turn := i. When turn = i, move on to Phase 2. Phase 2: Sets flag(i) := 2. Checks everyone else’s flag to see if any = 2. If so, go back to Phase 1. If not, move on and get the resource. To return the resource: Set flag(i) := 0. Dijkstra’s mutual exclusion algorithm works like this: A process named i attempts to get access to the resource by performing two phases: Phase 1: Sets flag(i) := 1. The repeatedly looks at shared “turn” variable, waiting for it to be set to its own index i. It can set turn := i if it sees turn = someone else’s index, j, and flag(j) = 0. When it sees turn = I, moves on to Phase 2. Phase 2: Sets flag(i) := 2. Then checks everyone else’s flag to see if any others = 2. If so, go back to Phase 1. If not, move on and get the resource.

Dijkstra’s Mutual Exclusion algorithm
It is not obvious that this algorithm is correct: Mutual exclusion, progress. Properties must hold regardless of order of read and write steps. Interleaving complications don’t arise in sequential algorithms. In general, how should we go about arguing correctness of such algorithms? This got me interested in learning how to prove properties of: Algorithms for systems of parallel processes that share memory. Algorithms in which processes communicate by channels (with possible delay). And led to work on general techniques for: Modeling distributed algorithms precisely Using interacting state-machine models. Proving their correctness. Notice that this algorithm is not obviously correct. It takes some argument to convince ourselves that this in fact guarantees that (a) at most one process has the resource at once, and (b) if the resource is available, and someone wants it, then someone will eventually get it. These properties must hold regardless of the order in which the processes take their individual little read and write steps. This sort of complication---having to consider arbitrary interleaving of steps---doesn’t arise when we study traditional sequential algorithms. It is not even obvious how one should go about arguing the correctness of such algorithms. This algorithms, and others like it, got me interested in learning how one should go about understanding the behavior of algorithms for systems of parallel processes that share memory, or for even harder-to-understand “distributed systems”, where the parallel processes communicate by longer-distance channels (with some delay) rather than (instantaneously accessible) shared memory. The need to argue about the correctness of such algorithms led me (and many others) to develop general techniques for modeling distributed algorithms precisely (using interacting state-machines) and techniques for proving their correctness.

Impossibility results
Distributed algorithms have inherent limitations, because they must work in badly-behaved settings: Arbitrary interleaving of process steps. Action based only on local knowledge. With precise models, we could hope to prove impossibility results, saying that certain problems cannot be solved, in certain settings. First example: [Cremers, Hibbard 76] Mutual exclusion with fairness: Every process who wants the resource eventually gets it. Not solvable for two processes with one shared variable, two values. Even if processes can use operations more powerful than reads/writes. Burns, Fischer, and I started trying to identify other cases where problems could provably not be solved in distributed settings That is, to understand the nature of computability in distributed settings. These early results suggested something else to me: Distributed algorithms have inherent limitations because of the fact that they are required to work in such badly-behaved settings---for example, with arbitrary interleavings of process steps. Also, processes must decide on what to do based only on their own “local knowledge”---the contents of their own state and what they read in variables (or, receive in messages). Once we had precise state-machine models for distributed algorithms, we could hope to prove such inherent limitations: that certain problems cannot be solved, in certain settings. The first results of this kind I encountered was one by Cremers and Hibbard, in an unpublished memo from USC. In fact, I think it was probably the first impossibility result proved in distributed computing theory. They showed that mutual exclusion cannot be solved for two processes, with one shared variable that can take on only two values. This is so even if the processes can access the variable using steps more powerful than just reads and writes. With my student Jim Burns and my frequent collaborator Michael Fischer, I soon got interested in trying to identify cases where problems could provably NOT be solved in distributed settings. That is---in trying to understand the nature of computability in distributed settings.

3. The next 20 years Lots of work on algorithms: Mutual exclusion, resource allocation, clock synchronization, distributed consensus, leader election, reliable communication… And even more work on impossibility results. And on modeling and verification methods. These early influences set the tone for the work my students, postdocs, and I did over the next 20 years. Lots of work on algorithms for problems like mutual exclusion, resource allocation, clock synchronization, consensus, leader election, reliable communication. And even more work on impossibility results. And on modeling and verification methods.

Example impossibility result [Burns, Lynch 93]
Mutual exclusion for n processes, using read/write shared memory, requires at least n shared variables. Even if: No fairness is required, just progress. Everyone can read and write all the variables. The variables can be of unbounded size. Example: n = 2. Suppose two processes solve mutual exclusion, with progress, using only one read/write shared variable x. Suppose process 1 arrives alone and wants the resource. By the progress requirement, it must be able to get it. Along the way, process 1 writes to the shared variable x: If not, process 2 wouldn’t know that process 1 was there. Then process 2 could get the resource too, contradicting mutual exclusion. p1 p2 x The Burns, Lynch impossibility result for mutual exclusion says that mutual exclusion with n processes, using read/write shared memory, requires at least n shared variables. Even if everyone is allowed to read and write all the variables, and even if the variables can be of unbounded size. You can get the idea why by seeing what happens for the case where n = 2. Suppose that 2 processes could somehow solve mutual exclusion, with progress (no fairness assumption), using just a single read/write shared variable. Then we can argue directly from the problem requirements to get a contradiction: Suppose process 1 arrives alone and wants the resource. The progress requirement says that it must be able to get it. Along the way, it has to write to the shared variable. For if not, process 2 wouldn’t know that process 1 was there, and could arrive and get the resource also, at the same time---contradicting the mutual exclusion requirement.

Impossibility for mutual exclusion
p1 arrives p1 gets the resource p1 writes x p2 gets the resource p2 writes x So, now consider an execution in which process 1 pauses just before it writes, for the first time. Then let process 2 arrive. Since process 1 hasn’t written, process 2 thinks it’s alone, and gets the resource. Along the way, it writes to x. Now, after process 2 gets the resource, resume process 1. The first thing it does is write to the shared variable x, thereby OVERWRITING whatever process 2 wrote there. Then process 1 cannot tell that process 2 is there. So it behaves just as it did when it operated alone, and gets the resource. Again, contradicting mutual exclusion. p1 writes x, overwriting p2 p1 gets the resource Contradicts mutual exclusion.

Impossibility for mutual exclusion
Mutual exclusion with n processes, using read/write shared memory, requires n shared variables: Argument for n > 2 is more intricate. Proofs done in terms of math models. Example shows the key ideas: A write operation to a shared variable overwrites everything previously in the shared variable. Process sees only its own state, and values of the variables it reads---its action depends on “local knowledge”. p1 p2 x1 pn x2 Of course, the argument for more processes is more intricate. And the proof is really done in terms of math models, not in terms of a story about processes. But the example above shows the key ideas: A write operation to a shared variable overwrites everything previously in the shared variable. A process can see only its own state, and the values of the variables it reads---this “local knowledge” determines what the process does.

Modeling and proof techniques
More and more clever, complex algorithms: [Gallager, Humblet, Spira 83] Minimum Spanning Tree algorithm. Communication algorithms in networks with changing connectivity [Awerbuch]. Concurrency control algorithms for distributed databases. Atomic memory algorithms [Burns, Peterson 87], [Vitanyi, Awerbuch 87] [Kirousis, Kranakis, Vitanyi 88],… We needed: A simple, general math foundation for modeling algorithms precisely, and Usable, general techniques for proving their correctness. We worked on these… Algorithms that people in the distributed computing theory (and practice) communities got more and move clever and complicated. For example: Gallager, Humblet, Spira Minimum Spanning Tree algorithm. Awerbuch’s algorithms for various communication tasks in networks with changing connectivity. Concurrency control algorithms for distributed databases. Atomic memory implementations (Vitanyi, Kranakis,…) It became clear we needed a general math foundation for modeling these algorithms precisely, and general techniques for proving their correctness. We worked on these.

Modeling techniques I/O Automata framework [Lynch, Tuttle, CWI Quarterly 89] I/O automaton: A state machine that can interact, using input and output actions, with other automata or with an external environment. Composition: Compose I/O automata to yield other I/O automata. Model a distributed system as a composition of process and channel automata. Levels of abstraction: Model a system at different levels of abstraction. Start from a high-level behavior specification. Refine, in stages, to detailed algorithm description. The main math foundation that arose was the I/O Automata framework of [Lynch, Tuttle]. An I/O automaton is a kind of state machine, which can interact by means of input and output actions, with other automata or with an external environment. I/O automata can be composed to yield other I/O automata. Thus, the I/O automata framework supports the description of a complicated distributed system or distributed algorithm as a composition of individual automata. E.g., automata can represent processes or channels. The framework also supports description of systems at different levels of abstraction, starting from a high-level behavior specification, and refining successively until we obtain a fully detailed algorithm description.

Proof techniques Invariant assertions, statements about the system state. Prove by induction on the number of steps in an execution. Entropy functions, to argue progress. Simulation relations: Construct abstract version of the algorithm. Need not be a distributed algorithm. Proof breaks into two pieces: Prove correctness of the abstract algorithm. Interesting, involves the deep logical ideas behind the algorithm. Tractable, because the abstract version is simple. Prove the real algorithm emulates the abstract version. A simulation relation. Tractable, generally a simple step-by-step correspondence. Does not involve the logical ideas behind the algorithm. But, how exactly should we prove correctness of complicated distributed algorithms, expressed in terms of I/O automata or other kinds of interacting state machines? We and others developed nice, systematic methods, for example: Invariant assertions, statements about the state of the system, proved by induction on the number of steps in an execution. Entropy functions, to argue progress. One method I helped to develop is the “simulation relation: method. The idea is to try to avoid reasoning about a complex algorithm in all its detail (with messages, particular values of state variables, etc.) Rather, construct a more abstract version of the algorithm---which need not even be a distributed algorithm. Then separate the work of proving the algorithm into two pieces: Carrying out a rigorous proof of the abstract version. This is the interesting part of the proof, where the deep logical ideas behind the algorithm are explained. But it’s tractable, because the abstract version is simpler than the fully detailed version. Prove a precise correspondence between the real algorithm and the abstract version. This is generally called a “simulation relation”. It’s tractable, because it generally involves a simple step-by-step correspondence between the two algorithms, and does not involve the deep logical ideas behind the algorithm.

Example: Mutual exclusion in a tree network
From [Lynch, Tuttle, CWI Quarterly 89] Allocate a resource (fairly) among processes at the nodes of a tree: Algorithm: Use token to represent the single resource. Token traverses subtree of active requests systematically. Describe abstract version: Graph with moving token. Prove the abstract version yields the needed properties. Prove a simulation relation between the real algorithm and the abstract version. Example: [Lynch, Tuttle, CWI Quarterly 89] A mutual exclusion algorithm for a tree networks. It is supposed to allocate a resource (fairly) among processes at the nodes of a tree, with communication channels between processes at adjacent nodes. One way such an algorithm could work is to have an explicit token represent the single resource. The token can travel around the tree, in response to requests. For example, it can carry out a systematic traversal of the subtree of active requests. Mutual exclusion can be based on the fact that there is exactly one token. Progress and fairness may be trickier---we may have to argue that the token travels in productive directions. To argue these properties carefully, it’s messy to talk about low-level details like state variables and messages. It’s much easier to talk about a token moving around a graph. But that’s not what the actual algorithm code looks like. So, we are led to describe an abstract version that involves a graph with a moving token. We prove that that version yields the needed properties. Then we prove a formal relationship between the real algorithm and the abstract version (a tedious, but easy, case analysis).

4. FLP [Fischer, Lynch, Paterson 83]
Impossibility of consensus in fault-prone distributed systems. My best-known result… Dijkstra Prize, 2001 Now I would like to tell you a little about my best-known result. This was awarded the Dijkstra Prize in 2001, which is an annual prize for influential papers in distributed computing theory. Technically, it was not called the Dijkstra Prize then. The award was renamed in honor of Dijkstra after his death in 2002.

Distributed Consensus
A set of processes in a distributed network, operating at arbitrary speeds, want to reach agreement. E.g., about: The value of a sensor reading. Whether to accept/reject the results of a database transaction. Abstractly, on a value in some set V. Each process starts with initial value in V, and they want to decide on a value in V: Agreement: Decide on the same value. Validity: It should be some process’ initial value. The twist: A (presumably small) number of processes might be faulty, and might not participate correctly in the algorithm. Problem appeared as: Database commit problem [Gray 78]. Byzantine agreement problem [Pease, Shostak, Lamport 80]. The problem involves a collection of processes in a distributed network, operating at arbitrary speeds, and wanting to reach agreement about something. The “something” might be, for example, the value of a sensor reading, or a decision about whether to accept or reject the results of a database transaction. Let’s just say, abstractly, that they are trying to agree on a value in some set V of values. Let’s say each process starts with some initial value in V, and they want to decide on some value in V. They want to all decide on the same value. And, it should be some process’ initial value. But now, the twist is that some (presumably small) number of the processes might be faulty, and might not participate correctly in the algorithm. Still, the other processes would like to reach agreement. The problem first surfaced in the form of the “database commit problem”, identified by Gray and others in the database community. Then it appeared in another form---as the Byzantine agreement problem---studied and popularized by Pease, Shostak, and Lamport.

FLP Impossibility Result
[Fischer, Lynch, Paterson 83] proved an impossibility result for distributed consensus. Proof works even for very limited failures: At most one process ever fails, and everyone knows this. The process may simply stop, without warning. Original result: Processes communicate using channels (with possible delays). Same result (essentially same proof) for read/write shared memory. Result seemed counter-intuitive: If there are many processes, and at most one can fail, then it seems like the rest could agree, and tell the faulty process the decision later… But nonfaulty processes don’t know that the other process has failed. But still, it seems like all but one of the processes could agree, then later tell the other process the decision (whether or not it has failed). But no, this doesn’t work! We [Fischer, Lynch, Paterson] proved an impossibility result for the distributed consensus problem. We proved it even for very limited kinds of failures: At most ONE process will ever fail, and everyone knows this. And, the only type of failure is that the process might just stop, without any warning. We originally proved the impossibility result in a setting where the processes communicate over channels, with delays. But it turns out that the same impossibility result holds (and essentially the same proof works), even if the processes can communicate using the stronger shared read/write memory model that I described earlier. And the proof is actually somewhat simpler. So I’ll talk about that version here. At the time we proved the result, it seemed somewhat counter-intuitive to many people. It seemed that, if there were lots of processes, and they knew that at most one of them could fail, then surely the rest could manage to agree without the faulty one, and tell it the decision later. Of course, a problem is that there is no way for the nonfaulty processes to KNOW that a faulty process has in fact stopped---this is an asynchronous setting (arbitrary timings), so they can’t use any information about time to infer that the process has failed. But still, it seems that all but one of the processes should be able to agree on a value, then later, whether or not the other process has failed, tell it the decision. But no, this doesn’t work…

FLP Impossibility proof
Proceed by contradiction---assume an algorithm exists to solve consensus, argue based on the problem requirements that it can’t work. Assume V = {0,1}. Notice that: In an “extreme” execution, in which everyone starts with 0, the only allowed decision is 0. Likewise, if everyone starts with 1, the only allowed decision is 1. For “mixed inputs”, the requirements don’t say. The idea of the proof is to proceed by contradiction, assuming that an algorithm esists to solve the problem, and then arguing based just on the problem requirements that this cannot work. Assume WLOG that V = {0,1}, so the processes are trying just to agree on a simple bit. Notice that, in an extreme execution in which everyone starts with 0, the only allowed decision is 0. Likewise, if everyone starts with 1, the only allowed decision is 1. For “mixed inputs”, the requirements don’t say.

 j i 0 only 1 only First prove that the algorithm must have the following pattern of executions: a “Hook”: If i takes the next step after , then the only possible decision thereafter is 0. If j takes the next step, followed by i, then the only possible decision is 1. Thus, we can “localize” the decision point to a particular pattern of executions. For, if not, we can maneuver the algorithm to continue executing forever, everyone continuing to take steps, and no one ever deciding. Contradicts requirement that all the nonfaulty processes should eventually decide. The heart of the proof is a pattern of executions called a “hook”. The proof first shows that, if the algorithm is going to always an allowable decision, then there must be a pattern of executions of the following kind: Some finite execution alpha, then a branch involving two processes, I and j. If I takes the next step, then the only possible decision value in any extending execution is 0. On the other hand, if j takes the next step, followed by I, then the only possible decision value in any extending execution is 1. Thus, we can “localize” the decision point of the algorithm to a particular pattern of executions. Whatever algorithm we use, it has to contain some such hook pattern of executions. For, if this doesn’t exist, then we can maneuver the algorithm to continue executing forever, with everyone continuing to take steps, and no one every deciding---this contradicts the requirement that all the nonfaulty processes should eventually decide. A Hook

 j i 0 only 1 only Now get a contradiction based on what processes j and i do in their respective steps. Each reads or writes a shared variable. They must access the same variable x: If not, then their steps are independent, so the order can’t matter. So different orders can’t result in different decisions, contradiction. Can’t both read x: Order of reads can’t matter, since reads don’t change x. That leaves three cases: i reads x and j writes x. i reads x and j reads x. Both i and j write x. But now we can get a contradiction based on what the two processes might be doing in their respective steps, in order to cause the decision to be made one way or the other. Remember, in the simple setting I’m considering here, all they could be doing is reading or writing a shared variable. It’s easy to see that they must be accessing the same variable x. For if not, then their steps are completely independent, so the order doesn’t matter. Therefore, different orders can’t result in different decisions. Similarly, if both are reading the variable x, the order can’t matter, since reads don’t change the variable. The rest of the proof simply considers the remaining cases: i reads x and j writes x; i reads x and j reads x; and both write x. A Hook

 j i 0 only 1 only Case 3: Both write x. What is different after  i vs.  j i? In one case, j writes to the variable x before i does. But in that case, i immediately overwrites what j wrote. So, the only difference is internal to j. If we fail j, we can run the rest of the processes after  i and after  j i, and they will do exactly the same thing. But this contradicts the fact that they must decide differently in the two cases! Case 1: i reads x and j writes x. Similar argument. Case 2: i writes x and j reads x. Consider, for example, the case where both write x. Then, what can be different after alpha i vs. alpha j i? Well, j writes to the variable x before i does, in one case and not in the other. However, in the case where j writes first, i immediately OVERWRITES what j wrote. So, the only difference is internal to j. Now, if we fail j, we can run the rest of the processes after alpha i and after alpha j i, and they will do exactly the same thing. But this contradicts the fact that they must decide differently in the two cases! The other two cases, where one reads and the other writes, have similar arguments. A Hook

Significance of FLP Significance for distributed computing practice:
Reaching agreement is sometimes important in practice: For agreeing on aircraft altimeter readings. Database transaction commit. FLP shows limitations on the kind of algorithm one can look for. Cannot hope for a timing-independent algorithm that tolerates even one process stopping failure. Main impact: Distributed computing theory 1. Variations on the result: FLP proved for distributed networks, with reliable broadcast communication. [Loui, Abu-Amara 87] extended FLP to read/write shared memory. [Herlihy 91] considered consensus with stronger fault-tolerance requirements: Any number of failures. Simpler proof. New proofs of FLP are still being produced. The impossibility result has had significance for distributed computing practice: Reaching agreement is sometimes important in practice, say for agreeing on aircraft altimeter readings or for agreeing on database transaction commit. The FLP impossibility result shows limitations on the kind of algorithm one can look for in trying to reach agreement in such cases. Namely, one cannot hope for a timing-independent algorithm that can tolerate even one simple stopping failure. The main importance of this result, however, is that it spawned a wealth of work in the theoretical distributed computing community. First, many variations on the result have been proved. For instance, our original result was for a distributed network setting, with reliable broadcast communication. Later, Loui and Abu-Amara strengthed the result by proving it for the better-behaved read/write shared memory setting. Also, Herlihy considered the same problem, still in the read/write shared-memory setting, with stricter requirements: tolerating any number of failures, not just one. For this stricter fault-tolerance requirement, he was able to come up with a simpler version of the proof. New proofs of the result are being produced even now.

Significance of FLP 2. Ways to circumvent the impossibility result:
Using limited timing information [Dolev, Dwork, Stockmeyer 87]. Using randomness [Ben-Or 83][Rabin 83]. Weaker guarantees: Small probability of a wrong decision, or Probability of terminating approaches 0 as time approaches infinity. Second, a series of papers were written on ways to circumvent the impossibility result, using, for example, limited timing information [Dolev, Dwork, Stockmeyer], or random choice [Ben-Or][Rabin]. However, when randomness is used, something is sacrificed in the problem guarantees: for example, there might be a small probability of a wrong decision, or else the probability of terminating is nonzero for any finite time---and approaches 0 as time approaching infinity.

Significance of FLP 3. New, “stabilizing” version of the requirements:
Agreement, validity must hold always. Termination required only if system behavior “stabilizes” for a while: No new failures. Timing (of process steps, messages) within “normal” bounds. Has good solutions, both theoretically and in practice. [Dwork, Lynch, Stockmeyer 88] algorithm: Keeps trying to choose a leader, who tries to coordinate agreement. Many attempts can fail. Once system stabilizes, unique leader is chosen, coordinates agreement. The tricky part: Ensuring failed attempts don’t lead to inconsistent decisions. [Lamport 89] Paxos algorithm. Improves on [DLS] by allowing more concurrency, and by having a funny story. Refined, engineered for practical use. [Chandra, Hadzilacos, Toueg 96] Failure detectors. Services that encapsulate use of time in stabilizing algorithms. Developed algorithms like [DLS], [Lamport], using failure detectors. Studied properties of failure detectors, identified weakest FD to solve consensus. Third, a new version of the requirements was defined, which turned out to have good solutions, both theoretically and in practice. The new version requires agreement and validity to hold in all cases. But termination is guaranteed to hold only if the behavior of the underlying distributed system “stabilizes” for a while. For example, if no one fails for a while, and timing of everything (process steps, message deliveries) are within some “normal” limits. The first algorithm of this kind was by [Dwork, Lynch, Stockmeyer]. Basically, the algorithm keeps attempting to choose a leader, who tries to coordinate agreement. Many attempts can fail. But, once the system stabilizes, a unique leader is guaranteed to be chosen and to succeed in coordinating agreement. The tricky part of the algorithm is to make sure that the failed attempts don’t cause any inconsistencies. Later, Lamport discovered and developed a similar algorithm the now-well-known Paxos algorithm. This iproved on DLS by allowing more concurrency, and by having a funny story. The Paxos algorithm has since been refined and engineered for practical use. Along the same lines, [Chandra, Hadzilacos, Toueg] defined the notion of a “failure detector”, a service that encapsulates the use of timing in these stabilizing algorithms. They developed algorithms similar to those of [DLS] and [Lamport Paxos], expressed in terms of failure detectors. This lead to an entire branch of research on its own---studying the properties and capabilities of various kinds of failure detectors, and trying to identify the “weakest failure detector” that is capable of solving consensus.

Significance of FLP 4. Characterizing computability in distributed systems, in the presence of failures. E.g., k-consensus: At most k different decisions occur overall. Problem defined by [Chaudhuri 93]. Characterization of computability in distributed settings: Solvable for k-1 process failures but not for k failures. Algorithm for k-1 failures: [Chaudhuri 93]. Matching impossibility result: [Chaudhuri 93] Partial progress, using arguments like FLP. [Herlihy, Shavit 93], [Borowsky, Gafni 93], [Saks, Zaharoglu 93] Godel Prize, 2004. Techniques from algebraic topology: Sperner’s Lemma. Used to obtain k-dimensional analogue of the Hook. Fourth and finally, FLP opened up an area of theoretical study: trying to characterize what problems are and are not computable in distributed systems, in the presence of various numbers of failures. For example, a notable generalization [Chaudhuri]’s problem of k-consensus, where the processes are allowed to disagree a little---as long as at most k different decisions appear overall. Chaudhuri defined the problem, and got some partial results about its computability in distributed settings. Subsequent work settled its computability questions completely: It can be solved in systems admitting k-1 process failures but not in systems with k failures. The solution for k-1 failures is a cute algorithm, by Chaudhuri. But she made only partial progress towards the impossibility part of the characgterization: she tried to use combinatorial arguments like those of FLP, and they didn’t quite work. So, others completed the impossibility result. In fact, this result was proved concurrently by three separate groups of researchers: [Herlihy, Shavit 93], [Borowsky, Gafni 93], and [Saks, Zaharoglu 93]. This work won the prestigious Godel Prize a couple of years ago. The techniques that ended up working were derived from algebraic topology, specifically, Sperner’s Lemma. This was used to obtain k-dimensional analogues of the “hook” construction.

Open questions related to FLP
Characterize exactly what problems can be solved in distributed systems: Based on problem type, number of processes, and number of failures. Which problems can be used to solve which others? Exactly what information about timing and/or failures must be provided to processes in order to make various unsolvable problems solvable? For example, what is the weakest failure detector that allows solution of k-consensus with k failures? What questions remain open, related to FLP? We still don’t have a complete characterization of what problems can be computed in distributed systems with various numbers of failures (Classify by type of problem, number n of processes, and number f of failures). What about relative computability? Which problems can be used to solve which others? Exactly what kind of information about timing and/or failures needs to be provided to the processes in a distributed system in order to make various unsolvable problems solvable? For example, what is the weakest failure detector that allows solution of k-consensus with k failures?

5. Modeling Frameworks Recall I/O automata [Lynch, Tuttle 87].
State machines that interact using input and output actions. Good for describing asynchronous distributed systems: no timing assumptions. Components take steps at arbitrary speeds Steps can interleave arbitrarily. Supports system description and analysis using composition and levels of abstraction. I/O Automata are adequate for much of distributed computing theory. But not for everything… Before finishing, I would like to mention briefly another rather large body of work that my collaborators and I have produced, much of it in the past few years: a collection of mathematical modeling frameworks for distributed systems, with various kinds of features. I have already mentioned the I/O automata framework of [Lynch, Tuttle]. It consists of state machines called I/O automata that interact using input and output actions. They are good for describing asynchronous distributed systems, that is, systems with no timing assumptions. The components take steps at arbitrary speeds, and the steps can interleave arbitrarily. The framework supports system description and analysis using composition and levels of abstraction. This framework is adequate as a basis for much of distributed computing theory. But not for everything.

Timed I/O Automata We need also to model and analyze timing aspects of systems. Timed I/O Automata, extension of I/O Automata [Lynch, Vaandrager 92, 94, 96], [Kaynar, Segala, L, V 05]. Trajectories describe evolution of state over a time interval. Can be used to describe: Time bounds, e.g., on message delay, process speeds. Local clocks, used by processes to schedule steps. Used for time performance analysis. Used to model hybrid systems: Real-world objects (vehicles, airplanes, robots,…) + computer programs. Hybrid I/O Automata [Lynch, Segala, Vaandrager 03] Also allows continuous interactions between components. Applications: Timing-based distributed algorithms, hybrid systems. Sometimes, we want to include the ability to model and reason about timing aspects of systems. [Kaynar, Lynch, Segala, Vaandrager 05], based on earlier work by [Lynch, Vaandrager] and [Lynch, Segala, Vaandrager], have developed the Timed I/O Automata framework---an extension of I/O automata that has features for modeling timing. TIOAs have “trajectories”, which describe the evolution of an automaton’s state over an interval of time, in addition to the usual discrete transitions. TIOAs can express time bounds, e.g., an upper bound on the time to deliver a message; or bounds on the speeds of processes. TIOAs can be used to describe distributed algorithms whose processes have local clocks, and use them to schedule some of their steps. They can be used for time analysis. TIOAs can also be used to model components of “hybrid systems”, which consist of real-world objects like vehicles, airplanes, robots, chemical solutions, as well as computer programs. Extending this theme some more, [Lynch, Segala, Vaandrager] have developed a “Hybrid I/O Automata” (HIOA) framework, which generalizes TIOA to allow continuous interactions as well as discrete interactions between components. We have used TIOA and HIOA to model and analyze many timing-based distributed algorithms and hybrid systems.

Probabilistic I/O Automata,…
[Segala 94] Probabilistic I/O Automata, Probabilistic Timed I/O Automata. Express random choices, random system behavior. Current work: Improving PIOA Composition, simulation relations. Current work: Integrating PIOA with TIOA and HIOA. The combination should allow modeling and analysis of any kind of distributed system we can think of. [Segala] has also developed Probabilistic I/O Automata and Probabilistic Timed I/O Automata, which include the ability to talk about explicit random choices and other random system behavior. We are still working on improving the PIOA model, and integrating it better with TIOA and HIOA. The combination should allow modeling of any kind of distributed system we can think of. All of these frameworks support description of systems in pieces, both as compositions of interacting components and using levels of abstraction. They all support nice, general proof methods.

6. New Challenges [Distributed Algorithms 96]:
Summarizes basic results of distributed computing theory, ca Asynchronous algorithms, plus a few timing-dependent algorithms. Fixed, wired networks. Still some open questions, e.g., general characterizations of computability. New frontiers in distributed computing theory: E.g., algorithms for mobile wireless networks. Much worse behaved than traditional wired networks. No one knows who the participating processes are. The set of participants may change Mobility Much harder to program. So, this area needs a theory! New algorithms. New modeling and analysis methods. New impossibility results, giving the limits of what is possible in such networks. The entire area is wide open for new theoretical work. A few years ago, I wrote a book, “Distributed Algorithms”, which summarizes most of the basic results of the field at that time. Basically, it’s about standard asynchronous distributed algorithms, plus a few timing-dependent algorithms. All in fixed, wired networks. There are still some open questions in these areas, e.g., the characterization questions I mentioned earlier. But now, entirely new frontiers are opening up for research in distributed computing theory. For example, the area of distributed algorithms for mobile wireless networks is just getting started. Such networks are very much worse behaved than traditional wired networks. No process knows who the other participating processes are. The set of participants may change, as nodes enter and leave an area. So it is much harder to program such networks to do interesting things, than it is for traditional networks. So, I think this area needs a theory. That means new algorithms. And new modeling and analysis methods. And new impossibility results, giving the limits of what is possible in such networks. The entire area is wide open for new theoretical work. My group is now working in this area, developing some algorithms and proving some impossibility results. For example, one promising general approach we are taking right now: building Virtual Node Layers. That is, we use the existing network to implement (emulate) a better-behaved network, as a higher level of abstraction. The Virtual Node Layer is easier to program, for many applications, including applications involving communication, data-management, and control. We are exploring this approach, both theoretically and via experiments. (Note: Our implementations are being done using CWI’s Python language…)

Distributed algorithms for mobile wireless networks
My group (and others) are now working in this area, developing algorithms, proving impossibility results. Clock synchronization, consensus, reliable communication,… One approach to algorithm design: Virtual Node Layers. Use the existing network to implement (emulate) a better-behaved network, as a higher level of abstraction. Use the Virtual Node Layer to implement applications. We are exploring VNLs, both theoretically and experimentally*. *Note: Using CWI’s Python language… My group is now working in this area, developing some algorithms and proving some impossibility results. For example, one promising general approach we are taking right now: building Virtual Node Layers. That is, we use the existing network to implement (emulate) a better-behaved network, as a higher level of abstraction. The Virtual Node Layer is easier to program, for many applications, including applications involving communication, data-management, and control. We are exploring this approach, both theoretically and via experiments. (Note: Our implementations are being done using CWI’s Python language…)

7. Epilogue Overview of our work in distributed computing theory, especially Impossibility results. Models and proof methods. Emphasis on FLP impossibility result, for consensus in fault-prone distributed systems. It was fun overviewing this collection of results; I hope you enjoyed some of it also.

Thanks to my collaborators:
Yehuda Afek, Myla Archer, Eshrat Arjomandi, James Aspnes, Paul Attie, Hagit Attiya, Ziv Bar-Joseph, Bard Bloom, Alan Borodin, Elizabeth Borowsky, James Burns, Ran Canetti, Soma Chaudhuri, Gregory Chockler, Brian Coan, Ling Cheung, Richard DeMillo, Murat Demirbas, Roberto DePrisco, Harish Devarajan, Danny Dolev, Shlomi Dolev, Ekaterina Dolginova, Cynthia Dwork, Rui Fan, Alan Fekete, Michael Fischer, Rob Fowler, Greg Frederickson, Eli Gafni, Stephen Garland, Rainer Gawlick, Chryssis Georgiou, Seth Gilbert, Kenneth Goldman, Nancy Griffeth, Constance Heitmeyer, Maurice Herlihy, Paul Jackson, Henrik Jensen, Frans Kaashoek, Dilsun Kaynar, Idit Keidar, Roger Khazan, Jon Kleinberg, Richard Ladner, Butler Lampson, Leslie Lamport, Hongping Lim, Moses Liskov, Carolos Livadas, Victor Luchangco, John Lygeros, Dahlia Malkhi, Yishay Mansour, Panayiotis Mavrommatis, Michael Merritt, Albert Meyer, Sayan Mitra, Calvin Newport, Tina Nolte, Michael Paterson, Boaz Patt-Shamir, Olivier Pereira, Gary Peterson, Shlomit Pinter, Anna Pogosyants, Stephen Ponzio, Sergio Rajsbaum, David Ratajczak, Isaac Saias, Russel Schaffer, Roberto Segala, Nir Shavit, Liuba Shrira, Alex Shvartsman, Mark Smith, Jorgen Sogaaard-Andersen, Ekrem Soylemez, John Spinelli, Eugene Stark, Larry Stockmeyer, Joshua Tauber, Mark Tuttle, Shinya Umeno, Frits Vaandrager, George Varghese, Da-Wei Wang, William Weihl, H.P.Weinberg, Jennifer Welch, Lenore Zuck,……and others I have forgotten to list. I want to thank all my collaborators over the years, on all of my work. Here are some of them:

Thank you! Again, I want to thank everyone at CWI for the great honor of receiving the Adriaan van Wijngaarden prize.

Impossibility of Consensus in Distributed Systems… and other tales about distributed computing theory Nancy Lynch MIT Adriaan van Wijngaarden lecture CWI.

Similar presentations

Presentation on theme: "Impossibility of Consensus in Distributed Systems… and other tales about distributed computing theory Nancy Lynch MIT Adriaan van Wijngaarden lecture CWI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Impossibility of Consensus in Distributed Systems… and other tales about distributed computing theory Nancy Lynch MIT Adriaan van Wijngaarden lecture CWI.

Similar presentations

Presentation on theme: "Impossibility of Consensus in Distributed Systems… and other tales about distributed computing theory Nancy Lynch MIT Adriaan van Wijngaarden lecture CWI."— Presentation transcript:

Similar presentations

About project

Feedback