Static and Dynamic Fault Diagnosis

Static and Dynamic Fault Diagnosis
Richard Beigel Univ. Illinois at Chicago and DIMACS

Nonstandard computing architectures
Perceptrons and small-depth circuits Optically interconnected multiprocessors DNA computing Self-diagnosing Systems

Brief history of system-level fault diagnosis
Preparata et al 67 static, nonadaptive Nakajima 81 static, adaptive, serial Hakimi & Nakajima 84 static, adaptive, parallel

Recent advances in system-level diagnosis
Distributed diagnosis Diagnosing intermittent faults Diagnosis with errors Fast parallel diagnosis of static faults Ongoing diagnosis and repair of dynamic faults

Fault diagnosis problem
Given n processors a primitive by which each processor can test any other a reliable external controller that observes test results Determine which are good and which are faulty Assume perfect communication in a complete network

What’s so hard about that?
Say Ah Ha Ha! OK, you pass Faulty processors may give incorrect test results

Possible test results

A majority of processors must be good for diagnosis to be possible
We’re all good They’re all faulty We’re all good They’re all faulty

Serial diagnosis of static faults
n processors, at most t faults, t < n/2 Nonadaptive diagnosis n(t+1) tests are necessary and sufficient [Preparata et al 67] Adaptive diagnosis n+t-1 tests are necessary and sufficient [Nakajima 81]

Distributed diagnosis of static faults
In the distributed diagnosis model there is no central controller, and all good processors must learn the status of the other processors. Distributed diagnosis is reducible to the “cooperative collect” problem, and can be solved with tests [Aspnes-Hurwood 96]

INTERMITTENT FAULTS AND ERRORS
Work in progress by Beigel and Fu

Intermittent faults An “intermittent” fault may appear faulty in some tests and good in others We cannot hope to diagnose intermittent faults as such because they might exhibit consistent behavior in all tests Goal: correctly diagnose all other processors

Errors An error is a misdiagnosis by a good processor.
Note the similarity to an intermittent fault faulty good good

Results In rounds, we can perform static diagnosis assuming that a majority of the processors are good and at most t of them are intermittently faulty. In rounds, we can perform static diagnosis in the presence of errors. Assuming at most t errors per round, the results will be within of a correct diagnosis.

PARALLEL DIAGNOSIS OF STATIC FAULTS
Perform many tests simultaneously

Parallel diagnosis of static faults
84 Hakimi & Schmeichel O(n/logn) 90 S & H & Otsuka & Sullivan O(logn) 89 Beigel & Kosaraju & Sullivan O(1) 93 Beigel & Margulis & Spielman 32 94 Beigel & Hurwood & Kahale 10 best lower bound = 5

Digraphs tester testee testing round = directed matching

SHOS 90 generates a large mutual admiration society
MAS = strongly connected component with all good edges Either all nodes good, or all nodes faulty g g g g g g g g g g

SHOS 90 O(logn) “pairing” algorithm
Pair up processors g Pair up pairs g g Pair up fours Obtain MAS of size (which must be all good) Test rest in 1 round

What about processors that don’t like each other?
Build one chain for each good processor we found (4 rounds) Most chains must have a good processor in each level (count!) Total: rounds f

Beigel-Margulis-Spielman 94
constructive (84 rounds) Find several MAS’s of size including at least one good MAS Large MAS’s test each other and all remaining processors in 6 rounds non (32 rounds) Find several MAS’s of size including at least one good MAS Large MAS’s test each other and all remaining processors in 4 rounds

Expander graphs guarantee a good big MAS
In the Cayley graphs of Margulis and LPS with p=37, every n/2-node induced subgraph contains a strong component of size (cf Alon & Chung 88, who find long paths) degree of undirected graph = 38 78 directed matchings cover graph = 84 rounds

Random graphs guarantee a good big MAS
If G consists of 14 directed Hamiltonian paths on n vertices then, whp, every n/2-node induced subgraph contains a strong component of size 28 directed matchings cover graph = 32 rounds

Beigel-Hurwood-Kahale 95 speeds up BMS 94
In k+1 rounds build MAS’s of size also build one chain of don’t-likes each MAS can be in simultaneous tests Perform G’s directed matchings in 1 round Process chain in 2 or 3 more rounds Constructive: 13 rounds. Non: 10 rounds.

Lower bound Upper bound for smaller t
n processors, at most t faults If rounds are necessary If rounds suffice algorithm uses lower-degree expanders

DIAGNOSIS AND REPAIR OF DYNAMIC FAULTS
Processors fail each round, but algorithm may order repairs

Ongoing diagnosis and repair of dynamic faults
Processors may fail each round, but algorithm may order repairs In each round 1. perform tests 2. direct that up to t processors are repaired 3. at most t processors fail Goal: bound number of faults at all times

Results for n processors at most t failures per round
When t > 70 and n > 376tlogt + 50t, we can maintain n - 64tlogt - 10t good processors at all times This works even if the number of faults exceeds n/2 When n = 640 and t = 1, we can maintain 520 good processors at all times.

Why’s this hard? We can’t determine the status of a chosen processor because its testers might fail right before we choose them Mutual admiration societies don’t work either

SIFT and WINNOW SIFT finds a large set G consisting of processors that were good when SIFT started running, and a small set F containing some faulty processors WINNOW uses G to diagnose most of the faulty processors in F Algorithm: SIFT, WINNOW, repair, repeat

SIFT algorithm Let r = 2logt
In 2r rounds form undirected hypercubes of size Put MAS’s into G, others into F MAS’s must have been entirely good at start of SIFT, and are still mostly good

WINNOW algorithm Choose a processor P in F For 2logt rounds,
test P and every processor that has tested P so far, using testers in G If the tests always call P faulty but don’t call any of the others faulty then we can be sure that P really is faulty Most old faults are diagnosed, but 4tlogt new ones could accumulate.

Summary We have efficient algorithms for
diagnosis in the presence of a small number of intermittent faults diagnosis with a small number of diagnosis errors parallel fault diagnosis ongoing diagnosis of dynamic faults

Static and Dynamic Fault Diagnosis

Similar presentations

Presentation on theme: "Static and Dynamic Fault Diagnosis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Static and Dynamic Fault Diagnosis

Similar presentations

Presentation on theme: "Static and Dynamic Fault Diagnosis"— Presentation transcript:

Similar presentations

About project

Feedback