ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering System Diagnosis.

Slides:



Advertisements
Similar presentations
Functions Reading: Epp Chp 7.1, 7.2, 7.4
Advertisements

Routing Complexity of Faulty Networks Omer Angel Itai Benjamini Eran Ofek Udi Wieder The Weizmann Institute of Science.
Marking Schema question1: 40 marks question2: 40 marks question3: 20 marks total: 100 marks.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.
10/28/2009VLSI Design & Test Seminar1 Diagnostic Tests and Full- Response Fault Dictionary Vishwani D. Agrawal ECE Dept., Auburn University Auburn, AL.
Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
Complexity ©D Moshkovitz 1 Approximation Algorithms Is Close Enough Good Enough?
1 Wide-Sense Nonblocking Multicast in a Class of Regular Optical Networks From: C. Zhou and Y. Yang, IEEE Transactions on communications, vol. 50, No.
Structural Reliability Analysis – Basics
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Yangjun Chen 1 Bipartite Graphs What is a bipartite graph? Properties of bipartite graphs Matching and maximum matching - alternative paths - augmenting.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture13: Mapping Reductions Prof. Amos Israeli.
Nov Malfunction Diagnosis and Tolerance Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Ch 5.2: Series Solutions Near an Ordinary Point, Part I
June 2007Malfunction DiagnosisSlide 1 Malfunction Diagnosis A Lecture in CE Freshman Seminar Series: Ten Puzzling Problems in Computer Engineering.
An Euler Circuit is a cycle of an undirected graph, that traverses every edge of the graph exactly once, and ends at the same node from which it began.
Yangjun Chen 1 Bipartite Graph 1.A graph G is bipartite if the node set V can be partitioned into two sets V 1 and V 2 in such a way that no nodes from.
Overview Distributed vs. decentralized Why distributed databases
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
The Byzantine Generals Strike Again Danny Dolev. Introduction We’ll build on the LSP presentation. Prove a necessary and sufficient condition on the network.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Nov Malfunction Diagnosis and ToleranceSlide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Copyright © Cengage Learning. All rights reserved.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
16.Greedy algorithms Hsu, Lih-Hsing. Computer Theory Lab. Chapter 16P An activity-selection problem Suppose we have a set S = {a 1, a 2,..., a.
1 Fault Tolerance in Collaborative Sensor Networks for Target Detection IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004.
1 Chapter 1: Introduction to Design of Experiments 1.1 Review of Basic Statistical Concepts (Optional) 1.2 Introduction to Experimental Design 1.3 Completely.
ECES 741: Stochastic Decision & Control Processes – Chapter 1: The DP Algorithm 1 Chapter 1: The DP Algorithm To do:  sequential decision-making  state.
7 Graph 7.1 Even and Odd Degrees.
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
Nov. 20, 2010 A pessimistic one-step diagnosis algorithms for cube-like networks under the PMC model Dr. C. H. Tsai Department of C.S.I.E, National Dong.
All that remains is to connect the edges in the variable-setters to the appropriate clause-checkers in the way that we require. This is done by the convey.
10.4 How to Find a Perfect Matching We have a condition for the existence of a perfect matching in a graph that is necessary and sufficient. Does this.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
Copyright © 2014, 2010 Pearson Education, Inc. Chapter 2 Polynomials and Rational Functions Copyright © 2014, 2010 Pearson Education, Inc.
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 3 Polynomial and Rational Functions Copyright © 2013, 2009, 2005 Pearson Education, Inc.
1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.
1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Copyright © Cengage Learning. All rights reserved. CHAPTER 7 FUNCTIONS.
EMIS 8373: Integer Programming NP-Complete Problems updated 21 April 2009.
1 Chapter 1: Introduction to Design of Experiments 1.1 Review of Basic Statistical Concepts (Optional) 1.2 Introduction to Experimental Design 1.3 Completely.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Information and Coding Theory Cyclic codes Juris Viksna, 2015.
Word : Let F be a field then the expression of the form a 1, a 2, …, a n where a i  F  i is called a word of length n over the field F. We denote the.
Re-Configurable Byzantine Quorum System Lei Kong S. Arun Mustaque Ahamad Doug Blough.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
OR Chapter 8. General LP Problems Converting other forms to general LP problem : min c’x  - max (-c)’x   = by adding a nonnegative slack variable.
Chapter 8 Maximum Flows: Additional Topics All-Pairs Minimum Value Cut Problem  Given an undirected network G, find minimum value cut for all.
Main Menu Main Menu (Click on the topics below) Combinatorics Introduction Equally likely Probability Formula Counting elements of a list Counting elements.
Replication predicates for dependent-failure algorithms Flavio Junqueira and Keith Marzullo University of California, San Diego Euro-Par Conference, Lisbon,
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
Matrices CHAPTER 8.9 ~ Ch _2 Contents  8.9 Power of Matrices 8.9 Power of Matrices  8.10 Orthogonal Matrices 8.10 Orthogonal Matrices 
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Byzantine faults and Agreement Problem (Sensor Networks)
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Problems With Assistance Module 3 – Problem 3 Filename: PWA_Mod03_Prob03.ppt This problem is adapted from: Exam #2 – Problem #1 – ECE 2300 – July 25,
The Consensus Problem in Fault Tolerant Computing
Static and Dynamic Fault Diagnosis
Better Adaptive Diagnosis of Hypercubes
Root-Locus Analysis (1)
Controllability and Observability of Linear Dynamical Equations
ECE 753: FAULT-TOLERANT COMPUTING
ECE 753: FAULT-TOLERANT COMPUTING
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Presentation transcript:

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering System Diagnosis

ECE 753 Fault Tolerant Computing2 Overview Introduction System Model Diagnosis Problem - PMC model Other Models and Comments Sequential Diagnosability Other Formulations, Algorithms, and ProblemsOther Formulations, Algorithms, and Problems Summary

ECE 753 Fault Tolerant Computing3 Introduction Reference [prad:96] Chapter 8, Original paper in IEEETC (Dec 1967) Diagnosis: an important part of recovery, maintenance and reconfigurationDiagnosis: an important part of recovery, maintenance and reconfiguration What is system level diagnosis: diagnose failed components in a large, possibly multiprocessor, systemWhat is system level diagnosis: diagnose failed components in a large, possibly multiprocessor, system Underlying needs: failures inevitable, units are smart/intelligent to test other units, hence need a different model and corresponding theoryUnderlying needs: failures inevitable, units are smart/intelligent to test other units, hence need a different model and corresponding theory

ECE 753 Fault Tolerant Computing4 System Model Model and Assumptions –Graph modelGraph model Processors/processes expressed as nodes Interconnects as links between nodes –Each processor is sufficiently powerful to test other processors comprehensivelyEach processor is sufficiently powerful to test other processors comprehensively –An example model with four nodesAn example model with four nodes –Test model: node V i tests V j then draw a directed link from V i to V jTest model: node V i tests V j then draw a directed link from V i to V j

ECE 753 Fault Tolerant Computing5 Diagnosis - PMC model (contd.) Example – Test Model v4v4 v3v3 v2v2 v1v1

ECE 753 Fault Tolerant Computing6 Diagnosis - PMC model (contd.) Assumptions –System with n unitsSystem with n units –Tests are comprehensiveTests are comprehensive –Test results are binary: good (0) /faulty (1)Test results are binary: good (0) /faulty (1) –Faulty units can not be trusted for their test outcomes (denote x – means can be 0 or 1)Faulty units can not be trusted for their test outcomes (denote x – means can be 0 or 1) –Total number of faulty units in the system is upper-bounded to tTotal number of faulty units in the system is upper-bounded to t –Example: system with four nodes and one faultExample: system with four nodes and one fault

ECE 753 Fault Tolerant Computing7 Diagnosis - PMC model (contd.) Example – Test outcomes Assume V 2 is faultyAssume V 2 is faulty v4v4 v3v3 v2v2 v1v x x

ECE 753 Fault Tolerant Computing8 Diagnosis - PMC model (contd.) One-step diagnosis –Analysis problem – give a system with n units, all the interconnects, and the test outcomes, identify the faulty units subject to the constraint that no more than t units in the system are faulty.Analysis problem – give a system with n units, all the interconnects, and the test outcomes, identify the faulty units subject to the constraint that no more than t units in the system are faulty. –Design problem – design a system using fewest possible test links such that all the faulty units can be correctly identified in one-step knowing the outcomes of the tests.Design problem – design a system using fewest possible test links such that all the faulty units can be correctly identified in one-step knowing the outcomes of the tests.

ECE 753 Fault Tolerant Computing9 Diagnosis - PMC model (contd.) One-step diagnosis - Example –Consider all possible outcomes -Consider all possible outcomes - fault a 12 a 23 a 24 a 31 a 41 a 43 none V 1 faulty x V 2 faulty 1 x x V 3 faulty x 0 1 V 4 faulty x x each row is called Syndrome of the fault

ECE 753 Fault Tolerant Computing10 Diagnosis - PMC model (contd.) Observations 1. Two possible syndromes associated with the fault V 1 and these are: and No two faults have overlapping syndromes Hence: we can correctly identify (diagnose) the faulty unit

ECE 753 Fault Tolerant Computing11 Diagnosis - PMC model (contd.) Consider two faulty units – say V 1 and V 2Consider two faulty units – say V 1 and V 2 possible syndrome x x x implies a possible outcome Therefore we can not determine if V 1 alone or both V 1 and V 2 are faulty. Thus two faults in this system can not be diagnosed in one- step.

ECE 753 Fault Tolerant Computing12 Diagnosis - PMC model (contd.) Result: A system is one-step t-fault diagnosable provided syndrome for each fault ( 0-fault, 1-fault, 2-faults, …, t-faults) are all distinct (non overlappling/non intersecting)Result: A system is one-step t-fault diagnosable provided syndrome for each fault ( 0-fault, 1-fault, 2-faults, …, t-faults) are all distinct (non overlappling/non intersecting) More results: - but first one more assumption – no two units test each other

ECE 753 Fault Tolerant Computing13 Diagnosis - PMC model (contd.) Result 1: For a system to be one-step t-fault diagnosable n ≧ 2t + 1 Result 2: For a system to be one-step t-fault diagnosable each unit must be tested by at least t other units Theorem: A system of n units in which no two units test each other is one step t-fault diagnosable if and only if each unit is tested by t other units.

ECE 753 Fault Tolerant Computing Diagnosis - PMC model (contd.) Design Problem – one-step t-fault diagnosable systemDesign Problem – one-step t-fault diagnosable system Example – n = 7, t = 3

ECE 753 Fault Tolerant Computing15 Diagnosis - PMC model (contd.) Design Problem: Algorithm for a simple one- step t-fault diagnosable with n ≧ 2t + 1Design Problem: Algorithm for a simple one- step t-fault diagnosable with n ≧ 2t Number the nodes from 0 to n-1 2. draw a link from node i to i+1 (mod n), i+2 (mod n), …, i+t (mod n). 3. System so designed is t-fault one-step diagnosable.

ECE 753 Fault Tolerant Computing16 Diagnosis - PMC model (contd.) Systems in which some units test each otherSystems in which some units test each other One-step t-fault diagnosability conditions are some what complex – See [prad:96]One-step t-fault diagnosability conditions are some what complex – See [prad:96] How does one check if a given system is one-step t-fault diagnosable –How does one check if a given system is one-step t-fault diagnosable – –Simple if no two units test each otherSimple if no two units test each other –Some what complex if units test each otherSome what complex if units test each other –There is a body of literature dealing with diagnosis algorithemsThere is a body of literature dealing with diagnosis algorithems

ECE 753 Fault Tolerant Computing17 Other Models and Comments Consider possible test outcomes when a unit V i tests unit V j – see the listing below V i V j outcomes G G G F F G F F

ECE 753 Fault Tolerant Computing18 Other Models/Comments(contd.) –4,5,6,7 PMC model4,5,6,7 PMC model –8,9,10,11 PMC with complement encoding8,9,10,11 PMC with complement encoding –0,15 of little value0,15 of little value –etc.etc. –Some subset of PMC are more interesting – for example 5,7 – this implies that a unit being tested is always correctly identified, if faulty, independent of the status of the testing unit. Many such variations have been studied.Some subset of PMC are more interesting – for example 5,7 – this implies that a unit being tested is always correctly identified, if faulty, independent of the status of the testing unit. Many such variations have been studied.

ECE 753 Fault Tolerant Computing19 Other Models/Comments(contd.) –Comparison based testing and diagnosisComparison based testing and diagnosis A paper is in the IEEE Transactions of Computers - February 2009 IssueA paper is in the IEEE Transactions of Computers - February 2009 Issue –Basically the model is built on PMC modelBasically the model is built on PMC model

ECE 753 Fault Tolerant Computing20 Sequential Diagnosability Consider the following repair strategy identify one or more faulty units repair them test system again and continue till we know that there are no more faulty units –This is called sequential diagnosis

ECE 753 Fault Tolerant Computing21 Sequential Diagnosability (contd.) Assumptions –Same as before:Same as before: System with n units Tests are comprehensive Test results are binary: good (0) /faulty (1) Faulty units can not be trusted for their test outcomes (denote x – means can be 0 or 1)Faulty units can not be trusted for their test outcomes (denote x – means can be 0 or 1) Total number of faulty units in the system is upper-bounded to tTotal number of faulty units in the system is upper-bounded to t

ECE 753 Fault Tolerant Computing22 Sequential Diagnosability (contd.) Result 1: For a system to be sequntially t-fault diagnosable n ≧ 2t + 1 It is not necessary for every unit to be tested by t units

ECE 753 Fault Tolerant Computing23 0 Sequential Diagnosability (contd.) Example – n = 7, t =

ECE 753 Fault Tolerant Computing24 Sequential Diagnosability (contd.) It is easy to show that the example system is sequentially 3-fault diagnosableIt is easy to show that the example system is sequentially 3-fault diagnosable Above construction will require n+2t–1 linksAbove construction will require n+2t–1 links A better solution: A system with n+2t-2 links can be designed that is sequentially t-fault diagnosableA better solution: A system with n+2t-2 links can be designed that is sequentially t-fault diagnosable

ECE 753 Fault Tolerant Computing25 Sequential Diagnosability (contd.) Proof: –First construct the system – n nodes form a single loop, thus containing n linksFirst construct the system – n nodes form a single loop, thus containing n links –Next choose some 2t-2 units and let these units test V 0 unitNext choose some 2t-2 units and let these units test V 0 unit –Now show that this system is sequentially t-fault diagnosable using the following three cases. Let n 1 indicate the number of units which find V 0 faulty. Similarly n 0 indicate the units that find V 0 not faulty. Clearly n 1 + n 0 = 2t-1Now show that this system is sequentially t-fault diagnosable using the following three cases. Let n 1 indicate the number of units which find V 0 faulty. Similarly n 0 indicate the units that find V 0 not faulty. Clearly n 1 + n 0 = 2t-1

ECE 753 Fault Tolerant Computing26 Sequential Diagnosability (contd.) Proof: –Case 1: n 1 > t ---- V 0 is faultyCase 1: n 1 > t ---- V 0 is faulty –Case 1: n 1 < t ---- V 0 is not faultyCase 1: n 1 < t ---- V 0 is not faulty –Case 1: n 1 = t ---- a fault free unit exists that is not involved in testing V 0Case 1: n 1 = t ---- a fault free unit exists that is not involved in testing V 0

ECE 753 Fault Tolerant Computing27 Sequential Diagnosability (contd.) Sequential diagnosis – single loop system –Example single loop system with n=5Example single loop system with n=5 –This is sequentially 2-fault diagnosable and can be demonstrated by constructing syndromes for different fault conditions. However, a system with n=9 is NOT sequentially 4-fault diagnosableThis is sequentially 2-fault diagnosable and can be demonstrated by constructing syndromes for different fault conditions. However, a system with n=9 is NOT sequentially 4-fault diagnosable –General result: A single loop system is sequentially t-fault diagnosable if and only ifGeneral result: A single loop system is sequentially t-fault diagnosable if and only if n  t + t 2 /4 + 2 for even t n  t + [(t-1)(t+1)/4] + 2 for odd t

ECE 753 Fault Tolerant Computing28 Other Formulations, Algorithms, and Problems Generalization of sequential diagnosability –Diagnose s faulty units at a time thus making a system t/s-sequentially diagnosableDiagnose s faulty units at a time thus making a system t/s-sequentially diagnosable Allow replacing up to t units – but not all units there are replaced are faulty. In other words non faulty units can be replaced as long as all the faulty units are within the replaced units (t/t fault diagnosability )Allow replacing up to t units – but not all units there are replaced are faulty. In other words non faulty units can be replaced as long as all the faulty units are within the replaced units (t/t fault diagnosability ) –An example in [prad:96] shows a system with 13 units, each unit is tested by 3 other units. Clearly such a system is only one-step 3-fault diagnosable. But it is shown to be 5/5 diagnosable.An example in [prad:96] shows a system with 13 units, each unit is tested by 3 other units. Clearly such a system is only one-step 3-fault diagnosable. But it is shown to be 5/5 diagnosable. Even additional formulations exist

ECE 753 Fault Tolerant Computing29 Other Formulations, Algorithms, and Problems Diagnosis algorithms – Given a syndrome and knowing that the system is t diagnosable, determine the set of faulty unitsDiagnosis algorithms – Given a syndrome and knowing that the system is t diagnosable, determine the set of faulty units –Possible solutionsPossible solutions Dictionary approach – some what impractical for large systemsDictionary approach – some what impractical for large systems Algorithmic approach – based on graph models and using solution to maximum matching problemAlgorithmic approach – based on graph models and using solution to maximum matching problem –Central v/s distributed algorithmsCentral v/s distributed algorithms Diagnosis and reconfiguration in homogenous and heterogeneous multicore systemsDiagnosis and reconfiguration in homogenous and heterogeneous multicore systems

ECE 753 Fault Tolerant Computing30 Summary System diagnosis model One-step t-fault diagnosis Sequential diagnosis