1 Distributed Resilient Consensus Nitin Vaidya University of Illinois at Urbana-Champaign Some research reported here is supported in part by.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
Teaser - Introduction to Distributed Computing
BY PAYEL BANDYOPADYAY WHAT AM I GOING TO DEAL ABOUT? WHAT IS AN AD-HOC NETWORK? That doesn't depend on any infrastructure (eg. Access points, routers)
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Announcements. Midterm Open book, open note, closed neighbor No other external sources No portable electronic devices other than medically necessary medical.
Distributed Algorithms – 2g1513 Lecture 10 – by Ali Ghodsi Fault-Tolerance in Asynchronous Networks.
Computer Science 425 Distributed Systems CS 425 / ECE 428 Consensus
Prepared by Ilya Kolchinsky.  n generals, communicating through messengers  some of the generals (up to m) might be traitors  all loyal generals should.
Byzantine Generals Problem: Solution using signed messages.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
1 Complexity of Network Synchronization Raeda Naamnieh.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
The Byzantine Generals Strike Again Danny Dolev. Introduction We’ll build on the LSP presentation. Prove a necessary and sufficient condition on the network.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
Composition Model and its code. bound:=bound+1.
Consensus and Related Problems Béat Hirsbrunner References G. Coulouris, J. Dollimore and T. Kindberg "Distributed Systems: Concepts and Design", Ed. 4,
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
1 Fault Tolerance in Collaborative Sensor Networks for Target Detection IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Equality Function Computation (How to make simple things complicated) Nitin Vaidya University of Illinois at Urbana-Champaign Joint work with Guanfeng.
Andreas Larsson, Philippas Tsigas SIROCCO Self-stabilizing (k,r)-Clustering in Clock Rate-limited Systems.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.
Consensus and Its Impossibility in Asynchronous Systems.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 8 Instructor: Haifeng YU.
Data Communications and Networking Chapter 11 Routing in Switched Networks References: Book Chapters 12.1, 12.3 Data and Computer Communications, 8th edition.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Random Graph Generator University of CS 8910 – Final Research Project Presentation Professor: Dr. Zhu Presented: December 8, 2010 By: Hanh Tran.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
1 Distributed Resilient Consensus Nitin Vaidya University of Illinois at Urbana-Champaign.
Iterative Byzantine Vector Consensus in Incomplete Graphs Nitin Vaidya University of Illinois at Urbana-Champaign ICDCN presentation by Srikanth Sastry.
SysRép / 2.5A. SchiperEté The consensus problem.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Alternating Bit Protocol S R ABP is a link layer protocol. Works on FIFO channels only. Guarantees reliable message delivery with a 1-bit sequence number.
DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
1 SECOND PART Algorithms for UNRELIABLE Distributed Systems: The consensus problem.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
Byzantine Vector Consensus in Complete Graphs Nitin Vaidya University of Illinois at Urbana-Champaign Vijay Garg University of Texas at Austin.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
The consensus problem in distributed systems
Distributed Systems – Paxos
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
Alternating Bit Protocol
Distributed Consensus
Agreement Protocols CS60002: Distributed Systems
Distributed Systems, Consensus and Replicated State Machines
Distributed Consensus
Fault-tolerant Consensus in Directed Networks Lewis Tseng Boston College Oct. 13, 2017 (joint work with Nitin H. Vaidya)
EEC 688/788 Secure and Dependable Computing
Consensus, FLP, and Paxos
EEC 688/788 Secure and Dependable Computing
Presentation transcript:

1 Distributed Resilient Consensus Nitin Vaidya University of Illinois at Urbana-Champaign Some research reported here is supported in part by

 This is a draft version of the slides  Final version will be made available on the webpage below: 2

Distributed Resilient Consensus Scope:  Distribute consensus  Resilient to  Node failures … with emphasis on Byzantine faults  Lossy communication links  Coverage biased by my own research  There exists a much larger body of work by the community

Consensus … Dictionary Definition  General agreement  Majority opinion 4

Many Faces of Consensus  What time is it?  Network of clocks … agree on a common notion of time 5

Many Faces of Consensus  Commit or abort ?  Network of databases … agree on a common action

Many Faces of Consensus  What is the temperature?  Network of sensors … agree on current temperature

Many Faces of Consensus  Where to go for dinner?  Network of friends… agree on a location Photo credit: Jon Hanson, Wiki Commons

Many Faces of Consensus  Should we trust ?  Web of trust … agree whether is good or evil 9

Consensus Correctness requirements:  Agreement … participating nodes agree  Validity … several variations  Output = input of a designated node  Output = in range of all (good) nodes  Output = average of all inputs  …  Termination … several variations

Many Faces of Consensus  No failures / failures allowed (node/link)  Synchronous/asynchronous  Deterministically correct / probabilistically correct  Exact agreement / approximate agreement  Amount of state maintained 11

Failure Models  Crash failure  Faulty node stops taking any steps at some point  Byzantine failure  Faulty node may behave arbitrarily  Drop, tamper messages  Send inconsistent messages 12

Timing Models  Synchronous  Bound on time required for all operations  Asynchronous  No bound 13

Correctness  Deterministic  Always results in correct outcome (agreement, validity)  Probabilistic  Correctness only with high probability  Non-zero probability of incorrect outcome 14

Agreement  Exact  All nodes agree on exactly identical value (output)  Approximate  Nodes agree on approximately equal values  Identical in the limit as time  ∞ 15

State How much state may a node maintain?  No constraint  “Minimal” state 16

Many Faces of Consensus  No failures / failures allowed (node/link)  Synchronous/asynchronous  Deterministically correct / probabilistically correct  Exact agreement / approximate agreement  Amount of state maintained 17

Consensus in Practice  Fault-tolerant servers  Distributed control 18

Client-Server Model Server Client state 0 command A

Client-Server Model Server Client response (A) state A command A

Client-Server Model Server Client command A Crash failure make service unavailable

Replicated Servers Servers Client command A state 0 Command B

Replicated Servers  All (fault-free) servers must execute identical commands in an identical order  Despite crash failures 23

Client-Server Model Server Client %*$&#command A Byzantine failure can result in incorrect response

Replicated Servers Servers Client command A A A state 0

Replicated Servers Servers Client %*$&# state A response (A)

Replicated Servers Servers Client command A state 0 Command B

Replicated Servers  All (fault-free) servers must execute identical commands in an identical order  Despite crash failures 28

Coordination  Different nodes must take consistent action  As a function of input from the various nodes 29

Binary Consensus 30

Consensus Each node has a binary input  Agreement: All outputs identical  Validity: If all inputs 0  Output 0 If all inputs 0  Output 1 Else output may be 0 or 1  Termination after bounded interval 31

Synchronous Fault-Free Systems Trivial Algorithm  Each node sends its input to all other nodes  Choose majority of received values as output 32

Synchronous + Crash Failures Previous algorithm does not work with a crash failure  Suppose four nodes: A : 0, B : 0, C : 1, D : 1  Node A fails after sending value to B, C without sending to D  B,C : 0,0,1,1 D : 0,1,1 33

Synchronous + up to f Crash Failures Initialization: S = input For f+1 rounds: Send any new value in S to all other nodes Receive values from other nodes After f+1 rounds Output = v if all values in S are v else 0 34

Synchronous + Crash Failures  Lower bound: at least f+1 rounds necessary to tolerate f crash failures 35

Asynchronous + Crash Failure  Consensus impossible in asynchronous systems in presence of even a single (crash) failure FLP [Fischer, Lynch, Paterson]  Intuition … asynchrony makes it difficult to distinguish between a faulty (crashed) node, and a slow node 36

How to achieve consensus with asynchrony?  Relax agreement or termination requirements  We will discuss “approximate” agreement 37

Synchronous + Byzantine Faults 38

Synchronous + Byzantine Consensus Each node has a binary input  Agreement: All outputs identical  Validity: If all fault-free inputs 0  Output 0 If all fault-free inputs 0  Output 1 Else output may be 0 or 1  Termination after bounded interval 39

Byzantine Consensus via Byzantine Broadcast Byzantine broadcast with source S  Agreement: All outputs identical  Validity: If source fault-free then output = source’s input Else output may be 0 or 1  Termination after bounded interval 40

Byzantine Consensus via Byzantine Broadcast  Each node uses Byzantine Broadcast to send its value to all others  Each node computed output = majority of values thus received 41

Byzantine Broadcast Example algorithm [Lamport,Shostak,Pease 1982]  4 nodes  At most 1 faulty node n = 4 f = 1 42

S Byzantine Broadcast 43 Faulty n = 4 input v

S v v v Byzantine Broadcast 44 Faulty n = 4 input v

S v v 45 v v Broadcast input v v

S v v 46 input v v v ? ? Broadcast v

S v v 47 input v v v ? v ? Broadcast v v

3 2 v v 48 input v v ? v ? [v,v,?] S 1 Broadcast v v v

3 2 v v 49 input v v ? v ? v v S 1 Majority vote  Correct Broadcast v v v

S v w x 50 Faulty Broadcast Bad source may attempt to diverge state at good nodes

S 3 2 v w 51 w w 1 Broadcast x

S 3 2 v w 52 x w w v x v 1 Broadcast x

S 3 2 v w 53 w w v x v 1 [v,w,x] Broadcast x x

S 3 2 v w 54 w w v x v 1 [v,w,x] Vote identical at good nodes Broadcast x x

Known Bounds on Byzantine Consensus  n ≥ 3f + 1 nodes to tolerate f failures  Connectivity ≥ 2f + 1  Ω(n 2 ) messages in worst case  f+1 rounds of communication 55

Ω(n 2 ) Message Complexity  Each message at least 1 bit  Ω(n 2 ) bits “communication complexity” to agree on just 1 bit value 56

Can we do better? 57

Lower Overhead Two solutions:  Probabilistically correct consensus (single instance)  Amortizing cost over many instances of consensus  We will illustrate this approach for multi-valued consensus 58

LSP Algorithm  Replication code 59 S v v v v v v v ? v ?

Multi-Valued Consensus L-bit inputs  Agreement: All nodes must agree  Validity: If all fault-free nodes have same input, output is that input else output may be any value  Termination 60

L-bit inputs Potential solution: Agree on each bit separately  Ω(n 2 L) bits communication complexity  Ω(n 2 ) bits communication complexity per agreed bit 61

L-bit inputs Potential solution: Agree on each bit separately  Ω(n 2 L) bits communication complexity  Ω(n 2 ) bits communication complexity per agreed bit Possible to achieve O(n) per agreed bit for large L 62

Intuition  Make the common case fast  Common case: No failure  Failure detection in “normal” operation, not tolerance  Learn from past misbehavior, and adapt 63

Make Common Case Fast S a b a+b 64 Two-bit value a, b

Make Common Case Fast S a b a+b b 65 b Two-bit value a, b a a [a,b,a+b] a+b [a,b,a+b] a+b

Make Common Case Fast S a b a+b b 66 b a, b a a [a,b,a+b] a+b [a,b,a+b] a+b

Make Common Case Fast S a b a+b b 67 b Two-bit value a, b a a [a,b,a+b] a+b [a,b,a+b] Parity check passes at all nodes  Agree on (a,b)

Make Common Case Fast S a b a+b 68 Two-bit value a, b ? ?

Make Common Case Fast S a b a+b b 69 b Two-bit value a, b ? ? [?,a,a+b] a+b Parity check fails at a node if 1 misbehaves to it

Make Common Case Fast S a b z b 70 b a a [a,b,z] z z Check fails at a good node if S sends bad codeword (a,b,z) [a,b,z]

Make Common Case Fast The example  n = 4 nodes, f = 1 (at most 1 failure)  Uses (3,2) code over binary symbols The general algorithm for f faults  Uses (n, n-f) error detecting code, slightly differently  Symbol size a function of n (number of nodes)

Algorithm Structure  Fast round (as in the example)

Algorithm Structure  Fast round (as in the example) 73 S a b a+b b b a, b a a [a,b,a+b] a+b [a,b,a+b] a+b

Algorithm Structure  Fast round (as in the example)  Fast round …  Fast round in which failure is detected

Algorithm Structure  Fast round (as in the example)  Fast round …  Fast round in which failure is detected S a b a+b b b a, b ? ? [?,a,a+b] a+b

Algorithm Structure  Fast round (as in the example)  Fast round …  Fast round in which failure is detected  Expensive round to learn new info about failure  Fast round …  Expensive round to learn new info about failure. 76

Algorithm Structure  Fast round (as in the example)  Fast round …  Fast round in which failure is detected  Expensive round to learn new info about failure  Fast round …  Expensive round to learn new info about failure. After a small number of expensive rounds, failures completely identified

Algorithm Structure  Fast round (as in the example)  Fast round …  Fast round in which failure is detected  Expensive round to learn new info about failure  Fast round …  Expensive round to learn new info about failure. After a small number of rounds failures identified  Only fast rounds hereon 78

Algorithm “Analysis”  Many fast rounds  Few expensive rounds  When averaged over all rounds, the cost of expensive rounds is negligible  Average cost depends only on the fast round which is very efficient  For large L, communication complexity is L * n(n-1) / (n-f) = O(nL) 79

Algorithm Structure  Fast round (as in the example)  Fast round …  Fast round in which failure is detected  Expensive round to learn new info about failure 80

Life After Failure Detection  Learn new information about identity of the faulty node How ?  Ask everyone to broadcast what they did during the previous “faulty fast round”  This is expensive 81

Life After Failure Detection  “He-said-she-said”  S claims sending a+b, C claims receiving z ≠ a+b 82 S a b z b b a a [a,b,z] z z

Life After Failure Detection  Diagnosis graph to track trust relationship  Initial graph : 83 S 3 2 1

Life After Failure Detection  S and 3 disagree  Remove edge S-3 84 S 3 2 1

Life After Failure Detection  The fast algorithm adapted to avoid using the link between S and 3  Key: Only use links between nodes that still trust each other  If more than f nodes do not trust some node, that node must be faulty 85

Algorithm “Analysis”  Many fast rounds  Few expensive rounds  When averaged over all rounds, the cost of expensive rounds is negligible  Average cost depends only on the fast round which is very efficient  For large L, communication complexity is L * n(n-1) / (n-f) = O(nL) 86

Impact of network capacity region 87

Impact of Network  Traditional metrics ignore network capacity region  Message complexity  Communication complexity  Round complexity  How to quantify the impact of capacity ? 88

Throughput  Borrow notion of throughput from networking  b(t) = number of bits agreed upon in [0,t] 89

Impact of Network  How does the network affect Byzantine broadcast/consensus ? 90

Questions  What is the maximum achievable throughput of Byzantine broadcast?  How to achieve it ? 91 S bit/second 10

 Throughput of LSP algorithm = 1 bit/second  Optimal throughput = 11 bits/second 92 S bit/second 10

Multi-Valued Byzantine Broadcast  A long sequence of input values  Each input containing many bits 93 input 1 input 2 input 3input 4 …

Proposed Algorithm  Multiple phases for each input value  Trust relation 94

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other For each input …

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other For each input … sender sends its input to all

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other For each input … check if all good nodes got same value

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other For each input … maintain consistent knowledge

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other For each input … learn new info about faulty nodes: trust relation

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other For each input … maintain consistent knowledge

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other dispute control expensive, but executed only a few times

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other For each input … amortized cost negligible when input size large

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other For each input … dominant network usage

Phase 1: Unreliable Broadcast Broadcast without fault tolerance  Only links between trusting node pairs used  Transmission along directed spanning trees rooted at source  Rate for Phase 1= number of directed unit-capacity spanning trees rooted at S 104

Phase 1: Unreliable Broadcast S

Phase 1: Unreliable Broadcast S Unit capacity links

Phase 1: Unreliable Broadcast Two directed spanning trees S

Phase 1: Unreliable Broadcast Two directed spanning trees S Each link in spanning tree carries 1 symbol a a a ab

Phase 1: Unreliable Broadcast S Each link in spanning tree carries 1 symbol a a a b b b ab

Phase 1: Unreliable Broadcast S z a a b b b ab zb

Phase 1: Unreliable Broadcast S a a a b b z az ab Faulty nodes may tamper packets

 Phase 1: Unreliable broadcast  Phase 2: Failure detection  Agreement on whether failure is detected  Phase 3 (optional): Dispute control  Agreement on which nodes trust each other For each input …

Phase 2: Failure Detection  Equality checking for every subset of n−f nodes  Faulty nodes should not be able to make unequal values appear equal 113

Failure Detection (Equality Checking) 114 S 2 31

Failure Detection (Equality Checking) 115 S 2 31 S 2 31 S 2 31 S 2 31

Checking with Undirected Spanning Trees 116 Each tree checks 1 data symbol S X2X2 YSYS XSXS Y3Y3 X S, Y S, Z S X 2, Y 2, Z 2 X 3, Y 3, Z 3 Z2Z2 ZSZS

Checking with Undirected Spanning Trees 117 Each tree checks 1 data symbol S X3X3 X3X3 X2X2 X2X2 X1X1 X1X1 X3X3 X2X2

Shortcoming  Each link used by multiple sets of n−f nodes 118

Failure Detection (Equality Checking) 119 S 2 31 S 2 31 S 2 31 S 2 31

Shortcoming  Each link used by multiple sets of n−f nodes Inefficient to check separately  Reuse symbols for solving all equality checks simultaneously 120

Phase 2: Failure Detection (Equality Checking)  Using local coding 121

Local Coding X S, Y S, Z S X 1, Y 1, Z 1 X 3, Y 3, Z 3 X 2, Y 2, Z 2 X 3 + 3Y 3 + 9Z 3 X 3 + 8Y Z 3 Each node sends linear combinations of its data symbols S

X 2 + 3Y 2 + 9Z 2 = X S + 3Y S + 9Z S ? X 2 + 8Y Z 2 Each node checks consistency of received packets with own data 123 X 2 + 3Y 2 + 9Z S X S, Y S, Z S X 1, Y 1, Z 1 X 3, Y 3, Z 3 X 2, Y 2, Z 2

Failure Detection with Local Coding All equality checks can be solved simultaneously if # data symbols ≤ min # undirected spanning trees over all sets of n−f nodes 124

Throughput Analysis 125

Throughput Analysis Overhead of Phases 1 and 2 dominant B = minimum rate of Phase 1 U = minimum rate of Phase 2 Throughput≥ = BU B + U 1 1/B + 1/U

Throughput Analysis Upper bound on optimal throughput (proof in paper) ≤ min( B, 2U ) Our algorithm ≥ ≥ optimal / 3, in general ≥ optimal / 2, if B ≤ U 127 BU B + U

Proof Sketch (1) Consider any N-f nodes: 1, 2, …, N-f Each has value X i = [X i,1 X i,2 … X i,r ] r = # data symbols Define difference: D i = X i – X N-f D 1 = D 2 = … = D N-f-1 = 0  X 1 = X 2 = … = X N-f 128

Proof Sketch (2) Checking in Phase 2 is equivalent to checking [D 1 D 2 … D N-f-1 ] C = 0 C --Matrix filled with coefficients (N-f-1)r rows Each column corresponds to one unit-capacity link and the corresponding linear combination 129

Proof Sketch (3) If r ≤ # undirected spanning trees among these nodes  #columns ≥ #rows = (N-f-1)r M -- (N-f-1)r-by-(N-f-1)r submatrix of C corresponds to the links along r spanning trees We can make M invertible. In this case [D 1 D 2 … D N-f-1 ] C = 0  [D 1 D 2 … D N-f-1 ] M = 0  [D 1 D 2 … D N-f-1 ] = 0  X 1 = X 2 = … = X N-f 130  Solves equality checking for one set of N-f nodes

Proof Sketch (4) If r ≤ min # undirected spanning trees over all sets of N-f nodes We can make all M matrices invertible simultaneously! 131 Rate for Phase 2 Solves equality checking for all sets of N-f nodes

So far … 132

Summary Exact consensus  Byzantine consensus with crash failures  Byzantine consensus with Byzantine failures  Impact of network capacity 133

Iterative Algorithms for Approximate Consensus 134

Outline  Iterative average consensus (no failures)  Impact of lossy links  Iterative Byzantine consensus  With and without synchrony 135

Average Consensus  want to decide which way to fly  Every proposes a direction  Consensus ≈ average of inputs 136

Centralized Solution  A leader collects inputs & disseminates decision 137 leader

Distributed Iterative Solution  Initial state a, b, c = input 138 c b a

Distributed Iterative Solution  State update (iteration) 139 a = 3a/4+ c/4 b = 3b/4+ c/4 c = a/4+b/4+c/2 c b a

140 a = 3a/4+ c/4 b = 3b/4+ c/4 c b a := = M M c = a/4+b/4+c/2

141 a = 3a/4+ c/4 b = 3b/4+ c/4 c b a := M M after 2 iterations c = a/4+b/4+c/2 = M 2 after 1 iteration

142 a = 3a/4+ c/4 b = 3b/4+ c/4 c b a := M k c = a/4+b/4+c/2 after k iterations

Reliable Nodes & Links  Consensus achievable iff at least one node can reach all other nodes  Average consensus achievable iff strongly connected graph with suitably chosen transition matrix M 143

Reliable Nodes & Links  Consensus achievable iff at least one node can reach all other nodes  Average consensus achievable iff strongly connected graph with suitably chosen transition matrix M 144 Row stochastic M Doubly stochastic M

145 a = 3a/4+ c/4 b = 3b/4+ c/4 c b a := M k  c = a/4+b/4+c/2 Doubly stochastic M

An Implementation: Mass Transfer + Accumulation  Each node “transfers mass” to neighbors via messages  Next state = Total received mass 146 c b a c/2 c/4 c = a/4+b/4+c/2 a = 3a/4+ c/4 b = 3b/4+ c/4 c/4

An Implementation: Mass Transfer + Accumulation  Each node “transfers mass” to neighbors via messages  Next state = Total received mass 147 c b a c/2 c/4 a/4 b/4 c = a/4+b/4+c/2 3b/4 3a/4 a = 3a/4+ c/4 b = 3b/4+ c/4

Conservation of Mass  a+b+c constant after each iteration 148 c b a c/2 c/4 a/4 b/4 c = a/4+b/4+c/2 3b/4 3a/4 a = 3a/4+ c/4 b = 3b/4+ c/4

Wireless Transmissions Unreliable 149 a = 3a/4+ c/4 b = 3b/4+ c/4 c = a/4+b/4+c/2 c b a c/4 X X

Impact of Unreliability 150 a = 3a/4+ c/4 b = 3b/4+ c/4 c = a/4+b/4+c/2 c b a c/4 X = X

Conservation of Mass 151 a = 3a/4+ c/4 b = 3b/4+ c/4 c = a/4+b/4+c/2 c b a c/4 X = X X

Average consensus over lossy links ? 152

Solution 1 Assume that transmitter & receiver both know whether a message is delivered or not 153

Solution 2 When mass not transferred to neighbor, keep it to yourself 154

Convergence … if nodes intermittently connected 155 a = 3a/4+ c/4 b = 3b/4+ c/4 c = a/4+b/4+c/2+c/4 c b a c/4 X = X

Solution 1 Common knowledge on whether a message is delivered 156 SR

Solution 1 Common knowledge on whether a message is delivered S knows R knows that S knows S knows that R knows that S knows R knows that S knows that R knows that … SR

Reality in Wireless Networks Common knowledge on whether a message is delivered Two scenarios:  A’s message to B lost … B does not send Ack  A’s message received by B … Ack lost 158 X

Reality in Wireless Networks Common knowledge on whether a message is delivered  Need solutions that tolerate lack of common knowledge  Theoretical models need to be more realistic 159 X

Solution 2  Average consensus without common knowledge … using additional per-neighbor state 160

Solution Sketch  S = mass C wanted to transfer to node A in total so far  R = mass A has received from node C in total so far 161 C A S R S R

Solution Sketch  Node C transmits quantity S …. message may be lost  When it is received, node A accumulates (S-R) 162 C A S R S-R S R

What Does That Do ? 163

What Does That Do ?  Implements virtual buffers 164 a b e d f g c

Dynamic Topology  When C  B transmission unreliable, mass transferred to buffer (d)  d = d + c/4 165 a b d c

Dynamic Topology  When C  B transmission unreliable, mass transferred to buffer (d)  d = d + c/4 166 a b d c No loss of mass even with message loss

Dynamic Topology  When C  B transmission reliable, mass transferred to b  b = 3b/4 + c/4 + d 167 a b d c No loss of mass even with message loss

Does This Work ? 168

Does This Work ? 169 time state

Why Doesn’t it Work ?  After k iterations, state of each node has the form z(k) * sum of inputs where z(k) changes each iteration (k)  Does not converge to average 170

Solution  Run two iterations in parallel  First : original inputs  Second : input = 1 171

Solution  Run two iterations in parallel  First : original inputs  Second : input = 1  After k iterations … first algorithm: z(k) * sum of inputs second algorithm: z(k) * number of nodes

Solution  Run two iterations in parallel  First : original inputs  Second : input = 1  After k iterations … first algorithm: z(k) * sum of inputs second algorithm: z(k) * number of nodes ratio = average

174 ratio denominator numerator time

Byzantine Faults  Faulty nodes may misbehave arbitrarily 175

Iterative Consensus with Byzantine Failures 176

177 a = 3a/4+ c/4 b = 3b/4+ c/4 b a c = 2 Byzantine node No consensus ! c =1

Iterative Consensus Dolev and Lynch, 1983  Tolerate up to f faults in complete graphs 178

Iteration at each node  Receive current state from all other nodes  Drop largest f and smallest f received values  New state = average of remaining values & old state 179

Iteration at each node  Receive current state from all other nodes  Drop largest f and smallest f received values  New state = average of remaining values & old state i A D C B f = 1

Iteration at each node  Receive current state from all other nodes  Drop largest f and smallest f received values  New state = average of remaining values & old state i A D C B /3 f = 1

 Recall … consensus without faults  Consensus possible if a single node can influence all others 182 How to generalize to Byzantine faults?

Necessary Condition 183 ≥ f+1 i j

Necessary Condition 184 Partition nodes into 4 sets F L C R ≥ f+1 i j

Necessary Condition potential fault set ≥ f+1 i j L, R non-empty

Necessary Condition 186 i or j must exist ≥ f+1 i j

Necessary Condition 187 ≥ f+1 i j i or j must exist

Intuition … proof by contradiction 188 L1 R1 R2 C1 F1 F f = 2

L1’s Perspective … if F1,F2 faulty 189 L1 R1 R2 C1 F1 F L1 must choose ≥ 0

L1’s Perspective … if R1,R2 faulty 190 L1 R1 R2 C1 F1 F L1 must choose ≤ 0

L1’s Perspective … if R1,R2 faulty 191 L1 R1 R2 C1 F1 F L1 must choose ≤ 0 L1 must choose ≥ 0

Equivalent Necessary Condition Reduced graph  Remove any potential fault set F  Remove additional f links from each remaining node 192

Equivalent Necessary Condition Reduced graph  Remove any potential fault set F  Remove additional f links from each remaining node In each reduced graph, some node must influence all others

Sufficiency … previous algorithm works Algorithm for each node  Receive current state from all neighbors  Drop largest f and smallest f received values  New state = average of remaining values & old state 194

Sufficiency Proof  State of the fault-free nodes converges to a value in the convex hull of their inputs 195

Proof Outline  V[t] = state of fault-free nodes after t-th iteration 196

Proof Outline  V[t] = state of fault-free nodes after t-th iteration  Represent t-th iteration as V[t] = M[t] V[t-1] 197

Proof Outline  V[t] = state of fault-free nodes after t-th iteration  Represent t-th iteration as V[t] = M[t] V[t-1] where M[t] depends on  V[t-1]  Behavior of faulty nodes but has nice properties 198

Proof Outline  V[t] = state of fault-free nodes after t-th iteration  Represent t-th iteration as V[t] = M[t] V[t-1] i-th row of M[t] corresponds to update of node i’s state 199

Node i 200 i A D C B f = 1

Node i  Values used for state update in convex hull of state of good neighbors 201 AD C B 1305

Node i  Values used for state update in convex hull of state of good neighbors 202 AD C B 1305 i 4 (1/3) 1+ (1/3) 3 + (1/3) 4

Node i  Values used for state update in convex hull of state of good neighbors 203 AD C B 1305 i 4 (1/3) 1+ (1/3) 3 + (1/3) 4 1 = (4/5) 0 + (1/5) 5

Node i  V[t] = M[t] V[t-1] with contribution of faulty node D replaced by contribution from A,B 204 AD C B 1305

Node i  V[t] = M[t] V[t-1] with contribution of faulty node D replaced by contribution from A,B  With care is taken in replacement, M[t] has following properties:  Row stochastic  Non-zero entries corresponding to links in a reduced graph 205

Reduced graph  Remove any potential fault set F  Remove additional ≤ f links from each remaining node In each reduced graph, some node must influence all others

Iterative Byzantine Consensus Results naturally extend to  Asynchrony  Time-varying topology  Arbitrary fault sets 207

Summary Consensus with  Faulty nodes  Lossy links 208