Static and Dynamic Fault Diagnosis

Slides:

Advertisements

Similar presentations

Routing Complexity of Faulty Networks Omer Angel Itai Benjamini Eran Ofek Udi Wieder The Weizmann Institute of Science.

Advertisements

CSE 211 Discrete Mathematics

Scalable and Dynamic Quorum Systems Moni Naor & Udi Wieder The Weizmann Institute of Science.

Chapter 7 - Local Stabilization1 Chapter 7: roadmap 7.1 Super stabilization 7.2 Self-Stabilizing Fault-Containing Algorithms 7.3 Error-Detection Codes.

Walks, Paths and Circuits Walks, Paths and Circuits Sanjay Jain, Lecturer, School of Computing.

Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.

1 Fault-Tolerant Computing Systems #6 Network Reliability Pattara Leelaprute Computer Engineering Department Kasetsart University

Dynamic Hypercube Topology Stefan Schmid URAW 2005 Upper Rhine Algorithms Workshop University of Tübingen, Germany.

Dependability Evaluation. Techniques for Dependability Evaluation The dependability evaluation of a system can be carried out either:  experimentally.

1 Algorithmic Aspects in Property Testing of Dense Graphs Oded Goldreich – Weizmann Institute Dana Ron - Tel-Aviv University.

The Byzantine Generals Strike Again Danny Dolev. Introduction We’ll build on the LSP presentation. Prove a necessary and sufficient condition on the network.

1 Introduction to Approximation Algorithms Lecture 15: Mar 5.

External-Memory MST (Arge, Brodal, Toma). Minimum-Spanning Tree Given a weighted, undirected graph G=(V,E), the minimum-spanning tree (MST) problem is.

Non-Adaptive Fault Diagnosis for All-Optical Networks via Combinatorial Group Testing on Graphs Nick Harvey, Mihai P ă traşcu, Yonggang Wen, Sergey Yekhanin.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Fixed Parameter Complexity Algorithms and Networks.

All that remains is to connect the edges in the variable-setters to the appropriate clause-checkers in the way that we require. This is done by the convey.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.

Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering System Diagnosis.

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Graph Theory. undirected graph node: a, b, c, d, e, f edge: (a, b), (a, c), (b, c), (b, e), (c, d), (c, f), (d, e), (d, f), (e, f) subgraph.

NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.

1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.

Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.

The Consensus Problem in Fault Tolerant Computing

EECS 203 Lecture 19 Graphs.

Data Center Network Architectures

Chapter 9 (Part 2): Graphs

New Characterizations in Turnstile Streams with Applications

Better Adaptive Diagnosis of Hypercubes

Michael Langberg: Open University of Israel

The minimum cost flow problem

The University of Adelaide, School of Computer Science

Minimum Spanning Tree 8/7/2018 4:26 AM

Euler Paths and Circuits

EECS 203 Lecture 20 More Graphs.

Algorithm Design and Analysis

Maximal Independent Set

Chapter 16: Distributed System Structures

Chapter 5. Optimal Matchings

CSCI1600: Embedded and Real Time Software

1.3 Modeling with exponentially many constr.

Data Structures and Algorithms

Overview: Fault Diagnosis

CIS 700: “algorithms for Big Data”

Introduction to Graph Theory Euler and Hamilton Paths and Circuits

Fault Tolerance Distributed Web-based Systems

Fault-tolerant Consensus in Directed Networks Lewis Tseng Boston College Oct. 13, 2017 (joint work with Nitin H. Vaidya)

Why Social Graphs Are Different Communities Finding Triangles

Introduction to locality sensitive approach to distributed systems

CSCI B609: “Foundations of Data Science”

RAID Redundant Array of Inexpensive (Independent) Disks

On the effect of randomness on planted 3-coloring models

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Sungho Kang Yonsei University

1.3 Modeling with exponentially many constr.

Chapter 15 Graphs © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.

Lecture 8: Synchronous Network Algorithms

5.4 T-joins and Postman Problems

ECE 753: FAULT-TOLERANT COMPUTING

Peer-to-peer networking

CSCI1600: Embedded and Real Time Software

Learning a hidden graph with adaptive algorithms

Lecture 27: More Graph Algorithms

Presentation transcript:

Static and Dynamic Fault Diagnosis Richard Beigel Univ. Illinois at Chicago and DIMACS

Nonstandard computing architectures Perceptrons and small-depth circuits Optically interconnected multiprocessors DNA computing Self-diagnosing Systems

Brief history of system-level fault diagnosis Preparata et al 67 static, nonadaptive Nakajima 81 static, adaptive, serial Hakimi & Nakajima 84 static, adaptive, parallel

Recent advances in system-level diagnosis Distributed diagnosis Diagnosing intermittent faults Diagnosis with errors Fast parallel diagnosis of static faults Ongoing diagnosis and repair of dynamic faults

Fault diagnosis problem Given n processors a primitive by which each processor can test any other a reliable external controller that observes test results Determine which are good and which are faulty Assume perfect communication in a complete network

What’s so hard about that? Say Ah Ha Ha! OK, you pass Faulty processors may give incorrect test results

Possible test results

A majority of processors must be good for diagnosis to be possible We’re all good They’re all faulty We’re all good They’re all faulty

Serial diagnosis of static faults n processors, at most t faults, t < n/2 Nonadaptive diagnosis n(t+1) tests are necessary and sufficient [Preparata et al 67] Adaptive diagnosis n+t-1 tests are necessary and sufficient [Nakajima 81]

Distributed diagnosis of static faults In the distributed diagnosis model there is no central controller, and all good processors must learn the status of the other processors. Distributed diagnosis is reducible to the “cooperative collect” problem, and can be solved with tests [Aspnes-Hurwood 96]

INTERMITTENT FAULTS AND ERRORS Work in progress by Beigel and Fu

Intermittent faults An “intermittent” fault may appear faulty in some tests and good in others We cannot hope to diagnose intermittent faults as such because they might exhibit consistent behavior in all tests Goal: correctly diagnose all other processors

Errors An error is a misdiagnosis by a good processor. Note the similarity to an intermittent fault faulty good good

Results In rounds, we can perform static diagnosis assuming that a majority of the processors are good and at most t of them are intermittently faulty. In rounds, we can perform static diagnosis in the presence of errors. Assuming at most t errors per round, the results will be within of a correct diagnosis.

PARALLEL DIAGNOSIS OF STATIC FAULTS Perform many tests simultaneously

Parallel diagnosis of static faults 84 Hakimi & Schmeichel O(n/logn) 90 S & H & Otsuka & Sullivan O(logn) 89 Beigel & Kosaraju & Sullivan O(1) 93 Beigel & Margulis & Spielman 32 94 Beigel & Hurwood & Kahale 10 best lower bound = 5

Digraphs tester testee testing round = directed matching

SHOS 90 generates a large mutual admiration society MAS = strongly connected component with all good edges Either all nodes good, or all nodes faulty g g g g g g g g g g

SHOS 90 O(logn) “pairing” algorithm Pair up processors g Pair up pairs g g Pair up fours Obtain MAS of size (which must be all good) Test rest in 1 round

What about processors that don’t like each other? Build one chain for each good processor we found (4 rounds) Most chains must have a good processor in each level (count!) Total: 4 + 1 rounds f

Beigel-Margulis-Spielman 94 constructive (84 rounds) Find several MAS’s of size including at least one good MAS Large MAS’s test each other and all remaining processors in 6 rounds non (32 rounds) Find several MAS’s of size including at least one good MAS Large MAS’s test each other and all remaining processors in 4 rounds

Expander graphs guarantee a good big MAS In the Cayley graphs of Margulis and LPS with p=37, every n/2-node induced subgraph contains a strong component of size (cf Alon & Chung 88, who find long paths) degree of undirected graph = 38 78 directed matchings cover graph 78 + 6 = 84 rounds

Random graphs guarantee a good big MAS If G consists of 14 directed Hamiltonian paths on n vertices then, whp, every n/2-node induced subgraph contains a strong component of size 28 directed matchings cover graph 28 + 4 = 32 rounds

Beigel-Hurwood-Kahale 95 speeds up BMS 94 In k+1 rounds build MAS’s of size also build one chain of don’t-likes each MAS can be in simultaneous tests Perform G’s directed matchings in 1 round Process chain in 2 or 3 more rounds Constructive: 13 rounds. Non: 10 rounds.

Lower bound Upper bound for smaller t n processors, at most t faults If 5 rounds are necessary If 4 rounds suffice algorithm uses lower-degree expanders

DIAGNOSIS AND REPAIR OF DYNAMIC FAULTS Processors fail each round, but algorithm may order repairs

Ongoing diagnosis and repair of dynamic faults Processors may fail each round, but algorithm may order repairs In each round 1. perform tests 2. direct that up to t processors are repaired 3. at most t processors fail Goal: bound number of faults at all times

Results for n processors at most t failures per round When t > 70 and n > 376tlogt + 50t, we can maintain n - 64tlogt - 10t good processors at all times This works even if the number of faults exceeds n/2 When n = 640 and t = 1, we can maintain 520 good processors at all times.

Why’s this hard? We can’t determine the status of a chosen processor because its testers might fail right before we choose them Mutual admiration societies don’t work either

SIFT and WINNOW SIFT finds a large set G consisting of processors that were good when SIFT started running, and a small set F containing some faulty processors WINNOW uses G to diagnose most of the faulty processors in F Algorithm: SIFT, WINNOW, repair, repeat

SIFT algorithm Let r = 2logt In 2r rounds form undirected hypercubes of size Put MAS’s into G, others into F MAS’s must have been entirely good at start of SIFT, and are still mostly good

WINNOW algorithm Choose a processor P in F For 2logt rounds, test P and every processor that has tested P so far, using testers in G If the tests always call P faulty but don’t call any of the others faulty then we can be sure that P really is faulty Most old faults are diagnosed, but 4tlogt new ones could accumulate.

Summary We have efficient algorithms for diagnosis in the presence of a small number of intermittent faults diagnosis with a small number of diagnosis errors parallel fault diagnosis ongoing diagnosis of dynamic faults