Static and Dynamic Fault Diagnosis Richard Beigel Univ. Illinois at Chicago and DIMACS
Nonstandard computing architectures Perceptrons and small-depth circuits Optically interconnected multiprocessors DNA computing Self-diagnosing Systems
Brief history of system-level fault diagnosis Preparata et al 67 static, nonadaptive Nakajima 81 static, adaptive, serial Hakimi & Nakajima 84 static, adaptive, parallel
Recent advances in system-level diagnosis Distributed diagnosis Diagnosing intermittent faults Diagnosis with errors Fast parallel diagnosis of static faults Ongoing diagnosis and repair of dynamic faults
Fault diagnosis problem Given n processors a primitive by which each processor can test any other a reliable external controller that observes test results Determine which are good and which are faulty Assume perfect communication in a complete network
What’s so hard about that? Say Ah Ha Ha! OK, you pass Faulty processors may give incorrect test results
Possible test results
A majority of processors must be good for diagnosis to be possible We’re all good They’re all faulty We’re all good They’re all faulty
Serial diagnosis of static faults n processors, at most t faults, t < n/2 Nonadaptive diagnosis n(t+1) tests are necessary and sufficient [Preparata et al 67] Adaptive diagnosis n+t-1 tests are necessary and sufficient [Nakajima 81]
Distributed diagnosis of static faults In the distributed diagnosis model there is no central controller, and all good processors must learn the status of the other processors. Distributed diagnosis is reducible to the “cooperative collect” problem, and can be solved with tests [Aspnes-Hurwood 96]
INTERMITTENT FAULTS AND ERRORS Work in progress by Beigel and Fu
Intermittent faults An “intermittent” fault may appear faulty in some tests and good in others We cannot hope to diagnose intermittent faults as such because they might exhibit consistent behavior in all tests Goal: correctly diagnose all other processors
Errors An error is a misdiagnosis by a good processor. Note the similarity to an intermittent fault faulty good good
Results In rounds, we can perform static diagnosis assuming that a majority of the processors are good and at most t of them are intermittently faulty. In rounds, we can perform static diagnosis in the presence of errors. Assuming at most t errors per round, the results will be within of a correct diagnosis.
PARALLEL DIAGNOSIS OF STATIC FAULTS Perform many tests simultaneously
Parallel diagnosis of static faults 84 Hakimi & Schmeichel O(n/logn) 90 S & H & Otsuka & Sullivan O(logn) 89 Beigel & Kosaraju & Sullivan O(1) 93 Beigel & Margulis & Spielman 32 94 Beigel & Hurwood & Kahale 10 best lower bound = 5
Digraphs tester testee testing round = directed matching
SHOS 90 generates a large mutual admiration society MAS = strongly connected component with all good edges Either all nodes good, or all nodes faulty g g g g g g g g g g
SHOS 90 O(logn) “pairing” algorithm Pair up processors g Pair up pairs g g Pair up fours Obtain MAS of size (which must be all good) Test rest in 1 round
What about processors that don’t like each other? Build one chain for each good processor we found (4 rounds) Most chains must have a good processor in each level (count!) Total: 4 + 1 rounds f
Beigel-Margulis-Spielman 94 constructive (84 rounds) Find several MAS’s of size including at least one good MAS Large MAS’s test each other and all remaining processors in 6 rounds non (32 rounds) Find several MAS’s of size including at least one good MAS Large MAS’s test each other and all remaining processors in 4 rounds
Expander graphs guarantee a good big MAS In the Cayley graphs of Margulis and LPS with p=37, every n/2-node induced subgraph contains a strong component of size (cf Alon & Chung 88, who find long paths) degree of undirected graph = 38 78 directed matchings cover graph 78 + 6 = 84 rounds
Random graphs guarantee a good big MAS If G consists of 14 directed Hamiltonian paths on n vertices then, whp, every n/2-node induced subgraph contains a strong component of size 28 directed matchings cover graph 28 + 4 = 32 rounds
Beigel-Hurwood-Kahale 95 speeds up BMS 94 In k+1 rounds build MAS’s of size also build one chain of don’t-likes each MAS can be in simultaneous tests Perform G’s directed matchings in 1 round Process chain in 2 or 3 more rounds Constructive: 13 rounds. Non: 10 rounds.
Lower bound Upper bound for smaller t n processors, at most t faults If 5 rounds are necessary If 4 rounds suffice algorithm uses lower-degree expanders
DIAGNOSIS AND REPAIR OF DYNAMIC FAULTS Processors fail each round, but algorithm may order repairs
Ongoing diagnosis and repair of dynamic faults Processors may fail each round, but algorithm may order repairs In each round 1. perform tests 2. direct that up to t processors are repaired 3. at most t processors fail Goal: bound number of faults at all times
Results for n processors at most t failures per round When t > 70 and n > 376tlogt + 50t, we can maintain n - 64tlogt - 10t good processors at all times This works even if the number of faults exceeds n/2 When n = 640 and t = 1, we can maintain 520 good processors at all times.
Why’s this hard? We can’t determine the status of a chosen processor because its testers might fail right before we choose them Mutual admiration societies don’t work either
SIFT and WINNOW SIFT finds a large set G consisting of processors that were good when SIFT started running, and a small set F containing some faulty processors WINNOW uses G to diagnose most of the faulty processors in F Algorithm: SIFT, WINNOW, repair, repeat
SIFT algorithm Let r = 2logt In 2r rounds form undirected hypercubes of size Put MAS’s into G, others into F MAS’s must have been entirely good at start of SIFT, and are still mostly good
WINNOW algorithm Choose a processor P in F For 2logt rounds, test P and every processor that has tested P so far, using testers in G If the tests always call P faulty but don’t call any of the others faulty then we can be sure that P really is faulty Most old faults are diagnosed, but 4tlogt new ones could accumulate.
Summary We have efficient algorithms for diagnosis in the presence of a small number of intermittent faults diagnosis with a small number of diagnosis errors parallel fault diagnosis ongoing diagnosis of dynamic faults