A Fusion-based Approach for Tolerating Faults in Finite State Machines

Slides:



Advertisements
Similar presentations
CMPS 3223 Theory of Computation
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
CPSC 668Set 12: Causality1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Distributed Control of FACTS Devices Using a Transportation Model Bruce McMillin Computer Science Mariesa Crow Electrical and Computer Engineering University.
Finite State Machines. Binary encoded state machines –The number of flip-flops is the smallest number m such that 2 m  n, where n is the number of states.
 Let A and B be any sets A binary relation R from A to B is a subset of AxB Given an ordered pair (x, y), x is related to y by R iff (x, y) is in R. This.
Part.7.1 Copyright 2007 Koren & Krishna, Morgan-Kaufman FAULT TOLERANT SYSTEMS Part 7 - Coding.
Tolerating Faults in Distributed Systems
CS3502: Data and Computer Networks DATA LINK LAYER - 1.
On Reducing the Global State Graph for Verification of Distributed Computations Vijay K. Garg, Arindam Chakraborty Parallel and Distributed Systems Laboratory.
Synthesis Of Fault Tolerant Circuits For FSMs & RAMs Rajiv Garg Pradish Mathews Darren Zacher.
Niloy Ganguly Biplab K Sikdar P Pal Chaudhuri Presented by Niloy Ganguly Indian Institute of Social Welfare and Business Management. Calcutta
CS3505: DATA LINK LAYER. data link layer  phys. layer subject to errors; not reliable; and only moves information as bits, which alone are not meaningful.
CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky.
Information Theory Linear Block Codes Jalal Al Roumy.
Computer Science Division
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
VLSI AND INTELLIGENT SYTEMS LABORATORY 12 Bit Hamming Code Error Detector/Corrector December 2nd, 2003 Department of Electrical and Computer Engineering.
Module #9 – Number Theory 1/5/ Algorithms, The Integers and Matrices.
Seminar On Rain Technology
Discrete Mathematics Chapter 2 The Fundamentals : Algorithms, the Integers, and Matrices. 大葉大學 資訊工程系 黃鈴玲.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Assignment Problem: Hungarian Algorithm and Linear Programming collected from the Internet and extended by Longin Jan Latecki.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
All-pairs Shortest paths Transitive Closure
Subject Name: COMPUTER NETWORKS-1
基于多核加速计算平台的深度神经网络 分割与重训练技术
Communication Networks: Technology & Protocols
Advanced Computer Networks
COMP541 Sequential Logic – 2: Finite State Machines
ECEN 248 Lab 9: Design of a Traffic Light Controller
Assignment Problem: Hungarian Algorithm and Linear Programming collected from the Internet and extended by Longin Jan Latecki.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Coding Theory Dan Siewiorek June 2012.
Chapter 10 Error Detection And Correction
Computational Molecular Biology
COS 463: Wireless Networks Lecture 9 Kyle Jamieson
湖南大学-信息科学与工程学院-计算机与科学系
ICOM 6005 – Database Management Systems Design
Lecture 7: Dynamic sampling Dimension Reduction
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Assignment Problem: Hungarian Algorithm and Linear Programming collected from the Internet and extended by Longin Jan Latecki.
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
RS – Reed Solomon List Decoding.
CSE 370 – Winter Sequential Logic-2 - 1
Coverage Approximation Algorithms
RAID Redundant Array of Inexpensive (Independent) Disks
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
COMP60621 Fundamentals of Parallel and Distributed Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Minimize # states in a DFSM
EEC 688/788 Secure and Dependable Computing
COS 463: Wireless Networks Lecture 9 Kyle Jamieson
EGR 2131 Unit 12 Synchronous Sequential Circuits
COMP60611 Fundamentals of Parallel and Distributed Systems
Cryptography Lecture 16.
Error Correction Coding
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Assignment Problem: Hungarian Algorithm and Linear Programming collected from the Internet and extended by Longin Jan Latecki.
Presentation transcript:

A Fusion-based Approach for Tolerating Faults in Finite State Machines Vinit Ogale, Bharath Balasubramanian Parallel and Distributed Systems Lab Electrical and Computer Engineering Dept. University of Texas at Austin Vijay K. Garg IBM India Research Lab

Outline Motivation Related Work Questions and Issues Addressed Model Partition Lattice Fault Graphs Fault Tolerance in FSMs and (f,m) – fusion Algorithms : Generating Backups and Recovery Implementation Results Conclusion and Future Work -In distributed systems, it is often necessary to maintain the execution state of servers, in case of faults. We provide a space efficient solution to the same. To delve further….program consists of 2 components, they worked on data structures…this was natural progression. we mainly target places where space is at a premium.

Motivation Many real applications modeled as FSMs Embedded Systems : Traffic controllers, home appliances Sensor networks E.g. hundreds of multiple sensors (like temperature, pressure etc) need to be backed up -In distributed systems, it is often necessary to maintain the execution state of servers, in case of faults. We provide a space efficient solution to the same. To delve further….program consists of 2 components, they worked on data structures…this was natural progression. we mainly target places where space is at a premium.

Problem Given a set of finite state machines (FSMs), some FSMs may either crash (fail-stop faults) or lie about their execution state (Byzantine faults) a a b b a0 a1 a2 b0 b1 b2 a b Counter counting ‘a’s Counter counting ‘b’s

Existing Solution - Replicate n.f extra FSMs to tolerate k crash faults; 2.n.f extra FSMs to tolerate f Byzantine faults (where n is the # of original FSMs) a a a a a0 a1 a2 a a Counter counting ‘a’s 1-crash fault tolerant setup b b b b b0 b1 b2 b b Counter counting ‘b’s

Related Work Traditional Approach – Redundancy n.k backup machines to tolerate k faults in n machines Fault Tolerance in Finite State Machines using Fusion (Balasubramanian, Ogale, Garg 08) Exponential algorithm for generating machines which can tolerate crash faults Number of faults = Number of Machines Fusible Data Structures (Garg, Ogale 06) Fuse common data structures such as link lists, hash tables etc – the fused structure smaller than sum of original structures Erasure Coding Fault Tolerance in Data - Fusions are erasure codes

Reachable Cross Product a Counter counting ‘a’s = <a1, b0> <a1, b1> <a1,b2> b b <a2, b0> <a2, b1> <a2, b2> B b0 b1 b2 R (A, B) b Reachable Cross Product of {A,B} Counter counting ‘b’s 7

Can We Do Better ? “a a b” a a a0 a1 a2 b b a a a F1 a b b b Counter counting ‘a’s (mod 3) F1 a b b b (a + b ) modulo 3 b0 b1 b2 b Counter counting ‘b’s (mod 3)

2-crash fault tolerant setup Can We Do Better ? b b a a a a F1 a0 a1 a2 a a b (a + b ) modulo 3 Counter counting ‘a’s (mod 3) 2-crash fault tolerant setup b a a b b b0 b1 b2 F2 b b b (a - b ) modulo 3 a Counter counting ‘b’s (mod 3)

Questions and Issues addressed Can we do better than the cross product ? How many faults can be tolerated ? What is the minimum number of machines required to tolerate f crash faults ? Can these machines tolerate Byzantine faults? (For example, in previous slide, DFSMs A and B along with F1 and F2 can tolerate one Byzantine fault ) Main Aims : Develop theory to understand and define this problem Efficient algorithms based on this to generate backup machines

Application Scenario: Sensor Network 1000 sensors (simple counters) each recording a parameter (temperature, pressure etc.). Sensors will be collected later and their data analyzed offline 10 sensors are expected to crash Replication requires 1000 x 10 backup sensors to ensure fault tolerant operation Can we use just 10 extra sensors instead of 10000?

Model Byzantine faults FSMs (machines) execute independently (in parallel) The inputs to a FSM are not determined by any other FSM. FSMs act concurrently on the same set of events Fail stop (crash) faults Loss of current state, underlying FSM intact Byzantine faults Machines can `lie` about their current state

Join of Two FSMs Join (t) : Reachable cross product: 4 states in this case instead of 9

Less Than Equal To Relation (·) Given FSMs: A and B A · B , A t B = B Given the state of B, we can determine the current state of A

Partitions Given any FSM, we can partition the states into blocks such that the transitions for all states in a block are consistent E.g. if states t0 and t3 have to be combined to form one partition t0 t3 t1 t2 Input 0 Input 1

Largest Consistent Partition Containing {t0,t3}

Largest Consistent Partition Containing {t0,t1} t0,t1, t2 t3 t0 t1 t2

Partition Lattice Set of all FSMs corresponding to partitions of a given FSM (say T) forms a lattice with respect to the · relation [HarSte66]. i.e, for any two FSMs, A and B, formed by partitioning T, there exists a unique C · T such that C = A t B : (join/ t ) A · C and B · C and C is the smallest such element C = A u B : (meet/ u) C · A and C · B and C is the largest such FSM

t3 > t0 t1 t2 F2 (B) F3 F4 F1 (A) t0,t3 t1 t2 t2,t3 t0,t2 t1 t3 t0 Add that the original machine can also be found in the lattice….. S2 S1 S3 S4 t0,t1,t2,t3 

Top Element (>) Given a set of FSMs: A = {A1, …, An} > = A1 t A2 t … t An All FSMs we consider henceforth are less than or equal to > Intuitively, > has information about the state of every machine in the original set, A Intuition .. repplicatiion

Bottom Element of Lattice (?) Single state FSM. contains one partition with all the states on any input it transitions to itself conveys no information about the current state of any machine

t3 > t0 t1 t2 F2 F3 F4 F1 t0,t3 t1 t2 t2,t3 t0,t2 t1 t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3 

Tolerating Faults F2 F1

Tolerating Faults F2 F1 X t3 > t0 t1 t2 T: Reachable cross product

Fault Graph: Fault tolerance indicator 1 1 t3 2 > t0 t2 X 2 F2 t0 t1 t2 2 2 t1 t2,t3 t0 t1 T: Reachable cross product Fault Graph G (A , T) A : { F1, F2} : Original machines

t3 t3 A = {FSMs in Yellow Region} 1 > 1 2 t0 t1 t2 t0 t2 2 2 2 F2 t1 F1 F3 F4 t0,t3 t1 t2 t2,t3 t0,t2 t1 t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3 

Hamming Distance dmin(T, A ) = 1 Hamming distance d(ti, tj) : weight of the edge separating the states (ti, tj) in the fault graph e.g. d(t0, t1) = 2 Minimum Hamming distance dmin(T, A ) : The weight of the weakest edge in the fault graph e.g. dmin(T, A ) = 1 t3 1 1 2 t0 t2 2 2 2 t1 dmin(T, A ) = 1

Fault Tolerance in FSMs (crash faults) Theorem 1 : A set of machines A can tolerate up to f crash faults iff : dmin(T(A), A ) > f e.g. A = {A,B,M1,M2} - dmin(T(A ), A ) = 3 - can tolerate 2 crash faults t3 3 4 4 t0 t2 3 3 3 t1 dmin(T(A), A ) = 3

Fault Tolerance in FSMs (Byzantine faults) Theorem 2 : A set of machines A can tolerate up to f Byzantine faults iff : dmin(T(A), A ) > 2f e.g. A = {A,B,M1,M2} Let the machines be in the following states: A = {t0, t3}, B = {t0}, M1 = {t0, t2}, M2 ={t3} B and M1 are lying about their state (f = 2) Since dmin(T(A), A ) = 3 < 4, we cannot determine the state of T t3 3 4 4 t0 t2 3 3 3 t1 dmin(T(A), A ) = 3

Fault Tolerance in FSMs (Byzantine faults) Let the machines be in the following states: A = {t0, t3}, B = {t0}, M1 = {t3}, M2 ={t3} Only B is lying about it’s state (f = 2) Since dmin(T(A), A ) = 3 > 2, we can determine the state of T as t3 Henceforth, dmin(T(A), A ) => dmin(A ) t3 3 4 4 t0 t2 3 3 3 t1 dmin(T(A), A ) = 3

Fault Tolerance and (f,m)- fusion Given a set of n machines, A , the set of m machines, F , is an (f,m)-fusion of A, if : dmin(A  F ) > f The set of machines in A  F can tolerate f crash faults or f/2 Byzantine faults E.g. A = {A,B}, F = {M1,M2}, dmin(A  F ) = 3 F = {M1,M2} is a (2,2) – fusion of A

Minimal Fusion Given a set of machines A, a fusion set F is minimal if there does not exist another (f, m)- fusion F' such that 8 F 2 F, 9 F' 2 F' : F' · F and 9( F 2 F, F' 2 F') : F' < F

A = {FSMs in Yellow Region} t3 > t0 t1 t2 (1,1) fusion F2 F1 F3 F4 t0,t3 t1 t2 t2,t3 t0,t2 t1 t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 Minimal (1,1) fusion t0,t1,t2,t3 

Minimal Fusion: Example t0,t3 t1 t2 t3 2 2 F2 t3 3 > t0 t2 X 2 t2,t3 t0 t1 t0 t1 t2 2 2 S4 t1 t0, t1,t2 t3 Fault Graph : G (A , T) A

Algorithm : Generating Backups Aim: Add the least possible number of machines that tolerate f faults Input: Set of machines A , number of faults f Output: Minimal fusion set with the least size If |T|= N , size of the event set if |E|, the time complexity of the algorithm is O(N3. |E|. f)

Algorithm overview f: # of faults, A : given set of machines While dmin (A  F)  f M := > While M   Compute lower cover of M , i.e. LC(M) If  machine F  LC(M): dmin (F  A  F)> dmin (A  F) M := F Else F := F  F Return F

w=1 A = {FSMs in Yellow Region} t3 t3 1 1 > 2 t0 t2 t0 t1 t2 2 2 2 

w=2 A = {FSMs in Yellow Region} t3 t3 2 2 > 3 t0 t2 t0 t1 t2 3 3 3 

w=2 A = {FSMs in Yellow Region} t3 t3 2 2 > 3 t0 t2 t0 t1 t2 3 3 2 

w=1 A = {FSMs in Yellow Region} t3 t3 1 2 > 2 t0 t2 t0 t1 t2 3 3 2 

w=2 A = {FSMs in Yellow Region} t3 t3 2 2 > 3 t0 t2 t0 t1 t2 2 2 2 

Algorithm : Recovery Aim: Recover the state of the faulty machines for f crash or f/2 Byzantine faults, given the state of the remaining machines Input: Current states of all available machines in A  F Output: Correct state of T The time complexity of the algorithm is O((n+ m) . f )

Algorithm overview S: set of current states of machines in A  F count : Vector of size |T|, initialized to 0 For all (s in S) do For all (ti in s) do ++count[i] return tc : 1 · c · N and count[c] is the maximal element in count

Algorithm : Example Consider machines A, B, M1,M2 : dmin ({A, B, M1,M2 }) = 3 ; they can tolerate one Byzantine fault Let the machines be in the following states: A = {t0, t3}, B = {t0}, M1 = {t1, t2,t3}, M2 ={t0} M1 is lying about it’s state The recovery algorithm will return t0 since, count[0] = 3, is greater than, count[1] = 1, count[2] = 1 and count[3] = 2

Experimental Results Original Machines f(faults) State space for replication State space for fusion MESI, Counter A and B, Shift register 2 7,569 1,521 Even and Odd Parity Checkers, Toggle Switch, Pattern Generator, MESI 3 262,144 32,768 Counters A and B, Divider, Machine A , Machine B 6,724 504 Pattern Generator, TCP, Machine A, Machine B 3,136 2464

Conclusion/Future Work It is not always necessary to have n.f backups to tolerate f faults Polynomial time algorithm to generate the smallest minimal set that tolerates f faults Implementation of this algorithm shows that many complex state machines have efficient fusions Will machines outside the lattice give better results? Backup Machines need to be given all events ; can we do better?