(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.

Daniel Sorin slide 2 A Computing Challenge for NASA NASA relies on computers NASA is much more demanding than most users –Must operate in harsh environments that cause hard faults –Must operate correctly for years –Must not require human to repair problems Our goal –Designing autonomic computer systems –Permanent faults will occur and computer will handle them

Daniel Sorin slide 3 But Isn’t This a Solved Problem? We could just use TMR (triple modular redundancy) CPU voter CPU But too much power usage to be feasible Especially for modern microprocessors output

Daniel Sorin slide 4 Key Observation Computer hardware is already modular –Improves performance –Simplifies design and verification Modular exists at many levels –Multiple processors per chip (CMP) –Multiple thread contexts per processor –Multiple functional units (e.g., adders) per processor –Multiple 4-bit adders in 64-bit adder –Multiple 1-bit adders in 4-bit adder –Etc. We can leverage this modularity!

Daniel Sorin slide 5 Modular Redundancy If computer has N widgets, add extra widget(s) Then provide: 1.Ability to detect errors 2.Ability to diagnose hard faults (that cause errors) 3.Ability to reconfigure and map in spare widget Cost: 1/N (or 2/N) instead of 2*N for TMR Benefit: can sometimes even be better than TMR! Simplistic example: –For processor with 8 adders, providing 2 more adders can tolerate 2 hard faults (in adders) –Replicating entire processor 3 times (TMR) can only tolerate one hard fault (in an adder)

Daniel Sorin slide 6 HMR: Hierarchical Modular Redundancy Provide modular redundancy at many levels –Processors, adders, multipliers, etc. Engineering issues involved in HMR –Allocating resources –Managing costs

Daniel Sorin slide 7 Allocating Resources For given hardware budget, how to allocate it Which level to allocate spares? –Better to have extra processor? –Or extra adders in each processor? –Or some combination of both? How many spares at each level? Can a spare be mapped in anywhere in system?

Daniel Sorin slide 8 Managing Costs Costs: extra modules, wires, and multiplexers Example: 3-bit addition, with module = 1-bit adder adder A1 B1 A2 B2 A3 B3 mux C1 C2 C3 mux

Daniel Sorin slide 9 Current Research Thrust #1 Explore modular redundancy within microprocessor Add extra array entries –In reorder buffer (ROB), branch history table (BHT), etc. Add extra functional units –Adders, multipliers, etc. 1.For error detection –Use “DIVA” or redundant threads 2.For hard fault diagnosis –Use threshold error counters 3.For reconfiguration –Use extra wires and multiplexers Modular array entry design published in International Symposium on Dependable Systems and Networks, 2004

Daniel Sorin slide 10 Current Research Thrust #2 Explore modular redundancy within 64-bit adder Start with 64-bit carry lookahead adder (CLA) –Hierarchy of 4-bit CLA modules Add 2 extra modules 1.Detect errors as before 2.Diagnose with counters and pattern matching –Based on error counter values, can diagnose fault! 3.Reconfigure with clever multiplexing scheme

Daniel Sorin slide 11 Conclusions and Future Work Hierarchical Modular Redundancy can provide high reliability at relatively low cost Future directions –Low-level: modular designs of components besides just adders (e.g., multipliers, decoding logic, etc.) –Mid-level: modular designs of microprocessors that can tolerate loss of currently critical logic (e.g., decoding) –High-level: HMR for chip multiprocessors

Daniel Sorin slide 12 Acknowledgments Several collaborators on this work Co-Investigator Prof. Sule Ozev (Duke ECE) Fred Bower (Duke CS grad and IBM) Mahmut Yilmaz (Duke ECE grad) Derek Hower (Duke ECE undergrad)

(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.

Similar presentations

Presentation on theme: "(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.

Similar presentations

Presentation on theme: "(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback