DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business) Vorlesung 5 FAULT DIAGNOSIS TECHNIQUES Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek
DS -V - FDT - 2 FAULT DIAGNOSIS TECHNIQUES OBJECTIVES: –TO INTRODUCE MAIN FAULT DETECTION AND FAULT LOCATION TECHNIQUES CONTENTS: –FAULT DETECTION TECHNIQUES FAULT LOCATION (ISOLATION) METHODS
DS -V - FDT - 3 FAULT DIAGNOSIS TECHNIQUES FAULT DETECTION + FAULT LOCATION = FAULT DIAGNOSIS FAULT DETECTION BY –REPLICATION CHECKS –TIMING CHECKS –REVERSAL CHECKS –CODING CHECKS –REASONABLENESS CHECKS –STRUCTURAL CHECKS –DIAGNOSTIC CHECKS –ALGORITHMIC CHECKS
DS -V - FDT - 4 REPLICATION CHECKS POWERFUL, COMPLETE, EXPENSIVE TESTS EXECUTION AGAINST ALTERNATE IMPLEMENTATION EXAMPLES –EXECUTE IDENTICAL COPIES ON SEPARATE HARDWARE ASSUMES DESIGN IS CORRECT AND ONLY COMPONENT FAILURES OCCUR INDEPENDENTLY –EXECUTE SEPARATE AND DIFFERENT VERSIONS WITH DIFFERENT DESIGNS ASSUMES DESIGN MAY BE INCORRECT AND DESIGN FAULTS OCCUR INDEPENDENTLY MAY PROVIDE CHECKING INFORMATION BUT BE UNEXECUTABLE – EXECUTE SAME COPY MULTIPLE TIMES ASSUMES FAULT IS TRANSIENT –REPLICATE ONLY A PORTION OF A SYSTEM ASSUMES REQUESTED RESPONSE IS CORRECT IF CORRECT FIXED RESPONSE IS ALSO GENERATED
DS -V - FDT - 5 EXAMPLE: 3B20 FROM AT&T
DS -V - FDT - 6 TIMING CHECKS A LIMITED REPLICATION CHECK TESTS EXECUTION AGAINST TIMING CONSTRAINTS EXAMPLES –WATCHDOG TIMER PROCESS RESETS TIMER INDICATING SATISFACTORY OPERATION IF TIME EXPIRES, ASSUME FAILED PROCESS –MESSAGE-BROADCASTING PROCESS BROADCASTS MESSAGE TO OTHER PROCESSES, RECIPIENTS CHECK FOR MESSAGE IF MESSAGE NOT RECEIVED, ASSUME FAILED SENDER –MESSAGE-REQUESTING PROCESS SENDS REQUEST TO OTHER PROCESS IF RETURN MESSAGE NOT RECEIVED, ASSUME FAILED RECIPIENT PROCESS
DS -V - FDT - 7 EXAMPLE: TANDEM SYSTEM
DS -V - FDT - 8 REVERSAL CHECKS INPUTS AND OUTPUTS ARE ONE-TO-ONE CALCULATES INPUTS FROM OUTPUTS AND TESTS AGAINST ACTUAL INPUTS EXAMPLES –REREAD DATA AFTER A WRITE –MATHEMATICAL FUNCTIONS ( SQRT(X) ) 2 = X ? A * A -1 = I ?
DS -V - FDT - 9 CODING CHECKS REDUNDANT REPRESENTATIONS OF OBJECTS EXAMPLES –PARITY BIT DETECT ODD NUMBER OF ERRORS –HAMMING CODE CORRECT SINGLE ERRORS –CYCLIC REDUNDANCY CODE DETECT ERRORS IN BLOCKS OF DATA –ARITHMETIC CODE BASED ON REMAINDER THEOREMS FOR RESIDUE ARITHMETIC –CHECKSUM DETECT ERRORS IN BLOCKS OF DATA –BERGER CODE NUMBER OF 1'S OR 0'S
DS -V - FDT - 10 REASONABLENESS CHECKS KNOWING THE SYSTEM INTERNAL DESIGN AND CONSTRUCTION TESTS STATES OF OBJECTS AGAINST INTENDED USE AND PURPOSE EXAMPLES –RANGE CHECKING ANGLE IN DEGREES IS WITHIN [0,360] ? –BOUNDS CHECKING ARRAY INDEX IS WITHIN BOUNDS ? –CONSISTENCY CHECKING ON-GROUND AIRCRAFT HAS UNRETRACTED WHEELS ? – TYPE CHECKING I.NUM IS INTEGER ? MODULO 2 (EVEN_NUMBER) = 0 ? – CAPABILITY CHECKING READ_ACCESS IS YES ? – RELIABILITY CALCULATION IS IT WITHIN [ 0, 1 ] ?
DS -V - FDT - 11 STRUCTURAL CHECKS CONSISTENT STRUCTURE OF DATA EXAMPLES –COUNT OF NUMBER OF ELEMENTS IN STRUCTURE –REDUNDANT POINTERS –STATUS INFORMATION CHECK SYSTEM CONFIGURATION
DS -V - FDT - 12 DIAGNOSTIC CHECKS TEST COMPONENTS USING A SET OF INPUTS FOR WHICH THE OUTPUTS ARE KNOWN PROGRAMS WHICH TEST FOR HARDWARE FAULTS EXAMPLES –MEMORY TESTS WRITE AND READ TEST PATTERNS –ENVIRONMENTAL TESTS RUN AT ABNORMAL VOLTAGES –LOAD TESTS RUN AT SATURATION LEVELS
DS -V - FDT - 13 ALGORITHMIC CHECKS CHECKING INVARIANTS OF AN ALGORITHM –EXAMPLE: SORTING NUMBER OF ENTRIES CHECKSUM INVARIANT CODES –EXAMPLE: MATRIX MULTIPLICATION (Abraham) A x B = C COLUMN ROW ROW/COLUMN CHECKSUM CHECKSUM CHECKSUMS OBTAINED FROM C AND BY A x B COMPARE x =
DS -V - FDT - 14 FAULT DIAGNOSIS TECHNIQUES IN MULTIPROCESSORS THE FAULT DIAGNOSIS SHOULD LOCATE A FIELD REPLACEABLE UNIT WHICH COULD BE –PROCESSOR(S) BOARD(S) –MEMORY(IES) BOARD(S) –SWITCHING ELEMENT(S) BOARD –INTERFACE BOARD –I/O BOARD(S) –SUPPORT PROCESSOR BOARD –SOFTWARE MODULES PACKAGING DETERMINES THE REQUIRED LEVEL OF FAULT DIAGNOSIS LOCATABILITY PACKAGING, TESTABILITY, DIAGNOSABILITY AND PERFORMANCE INSTRUMENTATION ARE USUALLY AFTERTHOUGHTS IN THE DESIGN PROCESS CONCURRENT ERROR DETECTION IS INDISPENSABLE IN MULTIPROCESSOR ENVIRONMENT DUE TO HIGH SYSTEM COMPLEXITY AND RAPID SYSTEM CONTAMINATION PLACE STRONG EMPHASIS ON APPLICATION (ALGORITHMIC) LEVEL DIAGNOSIS BY EMPLOYING SOPHISTICATED ACCEPTANCE TESTS CONCURRENT DIAGNOSIS SHOULD COVER ALL SYSTEM LEVELS
DS -V - FDT - 15 DIAGNOSIS TECHNIQUES PARALLEL VS. LOCALIZED DIAGNOSIS CENTRALIZED VS. DISTRIBUTED DIAGNOSIS DEPENDING ON THE FUNCTIONALITY STATEMENT AND THE MODEL ASSUMPTIONS, THE PROBLEM MAY VARY FROM RELATIVELY EASY TO EXTREMELY COMPLEX.
DS -V - FDT - 16 CONCURRENT ERROR DETECTION NUMEROUS ERROR DETECTION CODES ARE USED BUT FOR MULTIPROCESSORS THE FOLLOWING SEEM TO BE MOST EFFECTIVE: PROCESSORS –SIGNATURE ANALYSIS (AN EFFECTIVE ARITHMETIC CODE IS YET TO BE FOUND) MEMORIES –ERROR-CORRECTING CODES OR PARITY NETWORKS –PARITY OR BERGER CODE WITH RETRY; CRC METRICS INCLUDE: –FAULT CLASSES AND THEIR COVERAGE –COST –TIME (PERFORMANCE) –RELIABILITY