Download presentation
Presentation is loading. Please wait.
Published byGary Doyle Modified over 9 years ago
1
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz
2
Some Terms Concurrent error detection & repair: The system finds errors & repairs itself while still running In-line error checking: EDC, ECC On-line error correction: Correct error while the system can still operate Transient (soft) faults: Temporary faults or bit flips like Single Event Upsets Hard faults: Persistent faults that remain active for a significant period of time (forever?)
3
Background S/390 failure modes –Permanent, intermittent and transient faults –If an error occurs frequently and reaches a threshold permanent Thermal Conduction Module (TCM) –TCM: A liquid cooling method introduced by IBM – A series of spring loaded cylinders conduct the heat from chips to the cooling chamber –Circuit growth rates exceed reliability gains –Parity check and ECC were used –Circuits were encapsulated –System repair required all system resources –Most repairs were concurrent
4
Background (cont.) CMOS –G1 (1994) to G5 –G1: Less reliable than 9020 System failures are more probable –G2: Dynamic memory sparing –G3: More robust ECC & CPU sparing (manual replacement) –G4: Concurrent CPU sparing & CPU instruction level retry –G5: Most reliable Greatly exceeds any TCM Protected good against soft faults (hard faults?)
5
Microprocessor Fault Tolerant Design Duplication is used by several systems –Intel, Himalaya systems –Duplication requires more than 100% hardware overhead –Error detection only! Fetch-decode (I-Unit) and execute (E-Unit) are generally not protected –S/390 protects Transient fault rates are increasing with decreased feature sizes
6
Microprocessor Fault Tolerant Design (cont.) G5 Fault Tolerant Design Point –9X2: Main goal is to keep CPI low –G5: Main goal is to keep clock period short –In-line error protection is not suitable for G5: High fan-out/fan-in Increased chip area Longer wires Increased path length –Result: Duplicated I-unit and E-unit –A checker like DIVA checker: R-unit –Total hardware overhead: 35% –No performance penalty (?)
7
Microprocessor Fault Tolerant Design (cont.) G5 Fault Tolerant Design Point (cont.) –Recovery and on-line repair R-unit –L1: Store-through cache –L2: Shared memory Line sparing –Up on error detection: If retry is not successful CPU stopped –Dynamic CPU repairing (DCS) –Faulty CPU R-unit Spare CPU R-unit
8
Memory Fault Tolerance ECC Permanent fault in L1 Cache line or quarter cache delete Permanent fault in L2 Cache delete –Data array or address directory marked as invalid –Spare lines L3: Main memory –Background scrubbing –On-line repair: Built-in spare chips –Word line or chip kill After reaching threshold, replace module
9
I/O & Power/Cooling Subsystem Fault Tolerance Multiple paths Path redundancy Power/Cooling subsystems
10
Questions Is duplication the optimal choice? No protection against hard faults! How to protect a CPU against intermittent faults? (Delay faults) Generally, they are the beginning phase of a hard fault How to protect ALU by parity check? Adder? (page 868, 1 st parag.) If the retry is unsuccessful, the CPU is stopped. Would not it be better to use a counter to account for transient faults? What if a transient fault occurs while retrying?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.