Download presentation
Presentation is loading. Please wait.
1
Static and Dynamic Fault Diagnosis
Richard Beigel Univ. Illinois at Chicago and DIMACS
2
Nonstandard computing architectures
Perceptrons and small-depth circuits Optically interconnected multiprocessors DNA computing Self-diagnosing Systems
3
Brief history of system-level fault diagnosis
Preparata et al 67 static, nonadaptive Nakajima 81 static, adaptive, serial Hakimi & Nakajima 84 static, adaptive, parallel
4
Recent advances in system-level diagnosis
Distributed diagnosis Diagnosing intermittent faults Diagnosis with errors Fast parallel diagnosis of static faults Ongoing diagnosis and repair of dynamic faults
5
Fault diagnosis problem
Given n processors a primitive by which each processor can test any other a reliable external controller that observes test results Determine which are good and which are faulty Assume perfect communication in a complete network
6
What’s so hard about that?
Say Ah Ha Ha! OK, you pass Faulty processors may give incorrect test results
7
Possible test results
8
A majority of processors must be good for diagnosis to be possible
We’re all good They’re all faulty We’re all good They’re all faulty
9
Serial diagnosis of static faults
n processors, at most t faults, t < n/2 Nonadaptive diagnosis n(t+1) tests are necessary and sufficient [Preparata et al 67] Adaptive diagnosis n+t-1 tests are necessary and sufficient [Nakajima 81]
10
Distributed diagnosis of static faults
In the distributed diagnosis model there is no central controller, and all good processors must learn the status of the other processors. Distributed diagnosis is reducible to the “cooperative collect” problem, and can be solved with tests [Aspnes-Hurwood 96]
11
INTERMITTENT FAULTS AND ERRORS
Work in progress by Beigel and Fu
12
Intermittent faults An “intermittent” fault may appear faulty in some tests and good in others We cannot hope to diagnose intermittent faults as such because they might exhibit consistent behavior in all tests Goal: correctly diagnose all other processors
13
Errors An error is a misdiagnosis by a good processor.
Note the similarity to an intermittent fault faulty good good
14
Results In rounds, we can perform static diagnosis assuming that a majority of the processors are good and at most t of them are intermittently faulty. In rounds, we can perform static diagnosis in the presence of errors. Assuming at most t errors per round, the results will be within of a correct diagnosis.
15
PARALLEL DIAGNOSIS OF STATIC FAULTS
Perform many tests simultaneously
16
Parallel diagnosis of static faults
84 Hakimi & Schmeichel O(n/logn) 90 S & H & Otsuka & Sullivan O(logn) 89 Beigel & Kosaraju & Sullivan O(1) 93 Beigel & Margulis & Spielman 32 94 Beigel & Hurwood & Kahale 10 best lower bound = 5
17
Digraphs tester testee testing round = directed matching
18
SHOS 90 generates a large mutual admiration society
MAS = strongly connected component with all good edges Either all nodes good, or all nodes faulty g g g g g g g g g g
19
SHOS 90 O(logn) “pairing” algorithm
Pair up processors g Pair up pairs g g Pair up fours Obtain MAS of size (which must be all good) Test rest in 1 round
20
What about processors that don’t like each other?
Build one chain for each good processor we found (4 rounds) Most chains must have a good processor in each level (count!) Total: rounds f
21
Beigel-Margulis-Spielman 94
constructive (84 rounds) Find several MAS’s of size including at least one good MAS Large MAS’s test each other and all remaining processors in 6 rounds non (32 rounds) Find several MAS’s of size including at least one good MAS Large MAS’s test each other and all remaining processors in 4 rounds
22
Expander graphs guarantee a good big MAS
In the Cayley graphs of Margulis and LPS with p=37, every n/2-node induced subgraph contains a strong component of size (cf Alon & Chung 88, who find long paths) degree of undirected graph = 38 78 directed matchings cover graph = 84 rounds
23
Random graphs guarantee a good big MAS
If G consists of 14 directed Hamiltonian paths on n vertices then, whp, every n/2-node induced subgraph contains a strong component of size 28 directed matchings cover graph = 32 rounds
24
Beigel-Hurwood-Kahale 95 speeds up BMS 94
In k+1 rounds build MAS’s of size also build one chain of don’t-likes each MAS can be in simultaneous tests Perform G’s directed matchings in 1 round Process chain in 2 or 3 more rounds Constructive: 13 rounds. Non: 10 rounds.
25
Lower bound Upper bound for smaller t
n processors, at most t faults If rounds are necessary If rounds suffice algorithm uses lower-degree expanders
26
DIAGNOSIS AND REPAIR OF DYNAMIC FAULTS
Processors fail each round, but algorithm may order repairs
27
Ongoing diagnosis and repair of dynamic faults
Processors may fail each round, but algorithm may order repairs In each round 1. perform tests 2. direct that up to t processors are repaired 3. at most t processors fail Goal: bound number of faults at all times
28
Results for n processors at most t failures per round
When t > 70 and n > 376tlogt + 50t, we can maintain n - 64tlogt - 10t good processors at all times This works even if the number of faults exceeds n/2 When n = 640 and t = 1, we can maintain 520 good processors at all times.
29
Why’s this hard? We can’t determine the status of a chosen processor because its testers might fail right before we choose them Mutual admiration societies don’t work either
30
SIFT and WINNOW SIFT finds a large set G consisting of processors that were good when SIFT started running, and a small set F containing some faulty processors WINNOW uses G to diagnose most of the faulty processors in F Algorithm: SIFT, WINNOW, repair, repeat
31
SIFT algorithm Let r = 2logt
In 2r rounds form undirected hypercubes of size Put MAS’s into G, others into F MAS’s must have been entirely good at start of SIFT, and are still mostly good
32
WINNOW algorithm Choose a processor P in F For 2logt rounds,
test P and every processor that has tested P so far, using testers in G If the tests always call P faulty but don’t call any of the others faulty then we can be sure that P really is faulty Most old faults are diagnosed, but 4tlogt new ones could accumulate.
33
Summary We have efficient algorithms for
diagnosis in the presence of a small number of intermittent faults diagnosis with a small number of diagnosis errors parallel fault diagnosis ongoing diagnosis of dynamic faults
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.