Nov. 2007 Malfunction Diagnosis and ToleranceSlide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.

Slides:



Advertisements
Similar presentations
Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Advertisements

Great Theoretical Ideas in Computer Science
5.1 Real Vector Spaces.
Chapter 8 Topics in Graph Theory
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
Complexity ©D Moshkovitz 1 Approximation Algorithms Is Close Enough Good Enough?
Prepared by Ilya Kolchinsky.  n generals, communicating through messengers  some of the generals (up to m) might be traitors  all loyal generals should.
Oct. 2007State-Space ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Reliable System Design 2011 by: Amir M. Rahmani
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Feb. 2015Part V – Malfunctions: Architectural AnomaliesSlide 1 Robust Parallel Processing Resilient Algorithms.
Oct State-Space Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov Malfunction Diagnosis and Tolerance Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Oct. 2007Defect Avoidance and CircumventionSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
June 2007Malfunction DiagnosisSlide 1 Malfunction Diagnosis A Lecture in CE Freshman Seminar Series: Ten Puzzling Problems in Computer Engineering.
Oct Combinational Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov. 2006Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.
Introduction to Graphs
Increasing graph connectivity from 1 to 2 Guy Kortsarz Joint work with Even and Nutov.
Oct Fault Masking Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
May 2012Malfunction DiagnosisSlide 1 Malfunction Diagnosis A Lecture in CE Freshman Seminar Series: Ten Puzzling Problems in Computer Engineering.
Oct Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Nov. 2007Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.
Graphs, relations and matrices
Software Testing Sudipto Ghosh CS 406 Fall 99 November 9, 1999.
GRAPH Learning Outcomes Students should be able to:
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Nov. 20, 2010 A pessimistic one-step diagnosis algorithms for cube-like networks under the PMC model Dr. C. H. Tsai Department of C.S.I.E, National Dong.
All that remains is to connect the edges in the variable-setters to the appropriate clause-checkers in the way that we require. This is done by the convey.
1 ECE 453 – CS 447 – SE 465 Software Testing & Quality Assurance Instructor Kostas Kontogiannis.
Zvi Kohavi and Niraj K. Jha 1 Memory, Definiteness, and Information Losslessness of Finite Automata.
10.4 How to Find a Perfect Matching We have a condition for the existence of a perfect matching in a graph that is necessary and sufficient. Does this.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
Copyright © Cengage Learning. All rights reserved. CHAPTER 10 GRAPHS AND TREES.
© by Kenneth H. Rosen, Discrete Mathematics & its Applications, Sixth Edition, Mc Graw-Hill, 2007 Chapter 9 (Part 2): Graphs  Graph Terminology (9.2)
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering System Diagnosis.
1 Decomposition into bipartite graphs with minimum degree 1. Raphael Yuster.
CSCI 115 Chapter 8 Topics in Graph Theory. CSCI 115 §8.1 Graphs.
Matching Lecture 19: Nov 23.
Vertex Coloring Distributed Algorithms for Multi-Agent Networks
Nov. 2015Part V – Malfunctions: Architectural AnomaliesSlide 1 Robust Parallel Processing Resilient Algorithms.
D EPARTMENT /S EMESTER (ECE – III SEM) NETWORK THEORY SECTION-D Manav Rachna University 1.
Great Theoretical Ideas in Computer Science for Some.
 2004 SDU 1 Lecture5-Strongly Connected Components.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
1 GRAPH Learning Outcomes Students should be able to: Explain basic terminology of a graph Identify Euler and Hamiltonian cycle Represent graphs using.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 Lecture 5 (part 2) Graphs II (a) Circuits; (b) Representation Reading: Epp Chp 11.2, 11.3
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
More NP-Complete and NP-hard Problems
Static and Dynamic Fault Diagnosis
Minimum Spanning Tree 8/7/2018 4:26 AM
Introduction to Graphs
Instructor: Shengyu Zhang
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Part V – Malfunctions: Architectural Anomalies
Copyright © Cengage Learning. All rights reserved.
Malfunction Diagnosis
Switching Lemmas and Proof Complexity
Introduction to Graphs
Introduction to Graphs
Presentation transcript:

Nov Malfunction Diagnosis and ToleranceSlide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments

Nov Malfunction Diagnosis and ToleranceSlide 2 About This Presentation EditionReleasedRevised FirstNov. 2006Nov This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami

Nov Malfunction Diagnosis and ToleranceSlide 3 Malfunction Diagnosis and Tolerance

Nov Malfunction Diagnosis and ToleranceSlide 4

Nov Malfunction Diagnosis and ToleranceSlide 5 Multilevel Model of Dependable Computing ComponentLogicServiceResultInformationSystem IdealDefectiveFaultyErroneousMalfunctioningDegradedFailed Level  Low-Level ImpairedMid-Level ImpairedHigh-Level ImpairedUnimpaired EntryLegend:DeviationRemedyTolerance

Nov Malfunction Diagnosis and ToleranceSlide 6 Self-Diagnosis Layered approach: A small part of a unit is tested, which then forms a trusted core The trusted core is used to test the next layer of subsystems Region of trust is gradually extended, until it covers the entire unit One approach to go/no-go testing based on self-diagnosis Tester supplies a random seed to the built-in test routine The test routine steps through a long computation that exercises nearly all parts of the system, producing a final result The tester compares the final result to the expected result Ideally, if a properly designed self-test routine returns a 32-bit value, the value will match the expected result despite the presence of faults with probability 2 –32  10 –9.6  test coverage = 1 – 10 –9.6 Core Self-diagnosis initiation Unit A Unit B Unit C

Nov Malfunction Diagnosis and ToleranceSlide 7 Malfunction Diagnosis Model Diagnosis of one unit by another The tester sends a self-diagnosis request, expecting a response The unit under test eventually sends some results to the tester The tester interprets the results received and issues a verdict Testing capabilities among units is represented by a directed graph i Tester j Testee Test capability I think j is okay (passed test) 0 i Tester j Testee Test capability I think j is bad (failed test} 1 The verdict of unit i about unit j is denoted by D ij  {0, 1} All the diagnosis verdicts constitute the n  n diagnosis matrix D The diagnosis matrix D is usually quite sparse M0M0 M1M1 M3M3 M2M2

Nov Malfunction Diagnosis and ToleranceSlide 8 More on Terminology and Assumptions Malfunction diagnosis in our terminology corresponds to “system-level fault diagnosis” in the literature The qualification “system-level” implies that the diagnosable units are subsystems with significant computational capabilities (as opposed to gates or other low-level components) We do not use the entries on the main diagonal of the diagnosis matrix D (a unit does not judge itself) and we usually do not let two units test one another A good unit always issues a correct verdict about another unit (i.e., tests have perfect coverage), but the verdict of a bad unit is arbitrary and cannot be trusted This is known as the PMC model (Preparata, Metze, & Chien) We consider the PMC model only, but other models also exist (e.g., in comparison-based models, verdicts are derived from comparing the outputs of unit pairs) -- D D 12 D 13 D D D 32 --

Nov Malfunction Diagnosis and ToleranceSlide 9 Syndromes for these two conditions may be the same Simple Example with Four Units Consider this system, with the test outcomes shown We say that the system above is 1-step 1-diagnosable (we can correctly diagnose up to 1 malfunctioning unit in a single step) M0M0 M1M1 M3M3 M2M2 D 01 D 12 D 30 D 23 D 20 D 13 Diagnosis syndromes MalfnD 01 D 12 D 13 D 20 D 30 D 32 None M 0 0/ M 1 10/10/ M /1 0 1 M /10/1 M 0,M 1 0/10/10/ M 1,M 2 10/10/10/1 0 1

Nov Malfunction Diagnosis and ToleranceSlide 10 1-Step t-Diagnosability: Requirements An n-unit system is 1-step t-diagnosable if the diagnosis syndromes for conditions involving up to t malfunctions are all distinct Necessary conditions: 1. n  2t + 1; i.e., a majority of units must be good 2. Each unit must be tested by at least t other units M0M0 M1M1 M3M3 M2M2 D 01 D 12 D 30 D 23 D 20 D 13 Sufficient condition: An n-unit system in which no two units test one another is 1-step t-diagnosable iff each unit is tested by at least t other units The system above, has each unit tested by 1 or 2 units; it is 1-step 1-diagnosable It cannot be made 1-step 2-diagnosable via adding more test connections So, each unit being tested by t other units is both necessary and sufficient

Nov Malfunction Diagnosis and ToleranceSlide 11 Analogy: “Liars and Truth-Tellers” Puzzles You visit an island whose inhabitants are from two tribes Members of one tribe (“liars”) consistently lie Members of the other tribe (“truth-tellers”) always tell the truth You encounter a person on the island What single yes/no question would you ask him to determine his tribe? More generally, how can you derive correct conclusions from info provided by members of these tribes, without knowing their tribes? How would the problem change if the two tribes were “truth-tellers” and “randoms” (whose members give you random answers) M0M0 M1M1 M3M3 M2M2 0 0 x x 0 1 In the context of malfunction diagnosis: Truth-tellers are akin to good modules Randoms correspond to bad modules You do not know whether a module is good or bad Module “opinions” about other modules must be used to derive correct diagnoses

Nov Malfunction Diagnosis and ToleranceSlide 12 1-Step Diagnosability: Analysis & Synthesis A degree-t directed chordal ring, in which node i tests the t nodes i + 1, i + 2,..., i + t (all mod n) has the required property Synthesis problem: Specify the test links (connection assignment) that makes an n-unit system 1-step t-diagnosable; use as few test links as possible Analysis problems: 1. Given a directed graph defining the test links, find the largest value of t for which the system is 1-step t-diagnosable (easy if no two units test one another; fairly difficult, otherwise) 2. Given a directed graph and its associated test outcomes, identify all the malfunctioning units, assuming there are no more than t There is a vast amount of published work dealing with Problem 1 Problem 2 arises when we want to repair or reconfigure a system using test outcomes (solved via table lookup or analytical methods)

Nov Malfunction Diagnosis and ToleranceSlide 13 An O(n 3 )-Step Diagnosis Algorithm Input: The diagnosis matrix Output: Every unit labeled G or B while some unit remains unlabeled repeat choose an unlabeled unit and label it G or B use labeled units to label other units if the new label leads to a contradiction then backtrack endif endwhile M0M0 M1M1 M3M3 M2M step 1-diagnosable system M 0 is G (arbitrary choice) M 1 is B M 2 is B (contradiction, 2 Bs) M 0 is B (change label) M 1 is G M 2 is G M 3 is G More efficient algorithms exist

Nov Malfunction Diagnosis and ToleranceSlide 14 An O(n 2.5 )-Step Diagnosis Algorithm From the original testing graph, derive an L-graph The L-graph has the same nodes There is a link from node i to node j in the L-graph iff node i can be assumed to be malfunctioning when node j is known to be good M0M0 M1M1 M3M3 M2M Theorem: The unique minimal vertex cover of the L-graph is the set of t or fewer malfunctioning units Testing graph and test results MiMi MjMj M j good  M i malfunctioning Definition – Vertex cover of a graph: A subset of vertices that contains at least one of the two endpoints of each edge M0M0 M1M1 M3M3 M2M2 Corresponding L-graph

Nov Malfunction Diagnosis and ToleranceSlide 15 Sequential t-Diagnosability: Requirements An n-unit system is sequentially t-diagnosable if the diagnosis syndromes when there are t or fewer malfunctions are such that they always identify, unambiguously, at least one malfunctioning unit Necessary condition: n  2t + 1; i.e., a majority of units must be good This is useful because some systems that are not 1-step t-diagnosable are sequentially t-diagnosable, and they can be restored by removing the identified malfunctioning unit(s) and repeating the process This system is sequentially 2-diagnosable In one step, it is only 1-diagnosable Sequential diagnosability of directed rings: An n-node directed ring is sequentially t-diagnosable for any t that satisfies  (t 2 – 1)/4  + t + 2  n M0M0 M1M1 M4M4 M2M2 M3M3

Nov Malfunction Diagnosis and ToleranceSlide 16 Sequential Diagnosability: Analysis & Synthesis An n-node ring, with n  2t + 1, with added test links from 2t – 2 other nodes to node 0 (besides node n – 1 which already tests it) has the required property Synthesis problem: Specify the test links (connection assignment) that makes an n-unit system 1-step t-diagnosable; use as few test links as possible Analysis problems: 1. Given a directed graph defining the test links, find the largest value of t for which the system is sequentially t-diagnosable 2. Given a directed graph and its associated test outcomes, identify at least one malfunctioning unit (preferably more), assuming there are no more than t These problems have been extensively studied

Nov Malfunction Diagnosis and ToleranceSlide 17 Other Types of Diagnosability An n-unit system is 1-step t/s-diagnosable if a set of no more than t malfunctioning units can always be identified to within a set of s units, where s  t An n-unit system is sequentially t/s-diagnosable if from a set of up to t malfunctioning units, s can be identified in one step, where s < t The special case of 1-step t/t-diagnosability has been widely studied Given the values of t and s, the problem of deciding whether a system is t/s-diagnosable is co-NP-complete However, there exist efficient, polynomial-time, algorithms to find the largest integer t such that the system is t/t- or t/(t + 1)-diagnosable Safe diagnosability: Up to t malfunctions are correctly diagnosed and up to u detected (no danger of incorrect diagnosis for up to u malfunctions; reminiscent of combo error-correcting/detecting codes)

Nov Malfunction Diagnosis and ToleranceSlide 18 What Comes after Malfunction Diagnosis? When one or more malfunctioning units have been identified, the system must be reconfigured to allow it to isolate those units and to function without the unavailable resources In a bus-based system, we isolate malfunctioning units, remove them, and plug in good modules (standby spares or repaired ones) In a system having point-to-point connectivity, we reconfigure by rearranging the connections in order to switch in (shared) spares, using methods similar to those developed for defect circumvention Reconfiguration may involve: 1. Recovering state info from removed modules or back-up storage 2. Reassigning tasks and reallocating data 3. Restarting the computation from last checkpoint or from scratch

Nov Malfunction Diagnosis and ToleranceSlide 19 Reconfiguring Regular Architectures Shared spare

Nov Malfunction Diagnosis and ToleranceSlide 20 Reconfiguration Switching, Revisited Question: How do we know which cells/nodes must be bypassed? Must devise a scheme in which healthy nodes set the switches

Nov Malfunction Diagnosis and ToleranceSlide 21 Diagnosability of Regular Structures Diagnosability results have been published for a variety of regular interconnection networks

Nov Malfunction Diagnosis and ToleranceSlide 22 Malfunction Tolerance with Low Redundancy Reconfigurable 4  4 mesh with one spare The following example scheme uses only one spare processor for a 2D mesh (no increase in node degree), yet it allows system reconfiguration to circumvent any malfunctioning processor, replacing it with the spare via relabeling of the nodes