(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.

Slides:

Advertisements

Similar presentations

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs

Advertisements

1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.

An Integrated ECC and Redundancy Repair Scheme for Memory Reliability Enhancement National Tsing Hua University Hsinchu, Taiwan Chin-Lung Su, Yi-Ting Yeh,

Fault-Tolerant Systems Design Part 1.

COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

1 Microprocessor History. 2 The date is the year that the processor was first introduced. Many processors are re- introduced at higher clock speeds for.

3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.

Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.

Slide 1/20 Fault Tolerant Approaches to Nanoelectronic Programmable Logic Arrays Authors: Wenjing Rao, Alex Orailoglu, Ramesh Karri Conference: DSN 2007.

Nov. 2006Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.

ELEC 7250 Term Project Presentation Khushboo Sheth Department of Electrical and Computer Engineering Auburn University, Auburn, AL.

1 Chapter Fault Tolerant Design of Digital Systems.

2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.

Combinatorial Logic Design Process. Example: Three 1s pattern detector Detect whether a pattern of at least three adjacent 1s occurs anywhere in an 8-bit.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three.

Oct Fault Masking Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Combinational Logic1 DIGITAL LOGIC DESIGN by Dr. Fenghui Yao Tennessee State University Department of Computer Science Nashville, TN.

University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Memory Hierarchy 2.

CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.

Chapter 3 Digital Logic Structures. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 3-2 Building Functions.

1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University

C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.

Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.

Post-Manufacturing ECC Customization Based on Orthogonal Latin Square Codes and Its Application to Ultra-Low Power Caches Rudrajit Datta and Nur A. Touba.

Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility –A well.

 Chapter 13 – Dependability Engineering 1 Chapter 12 Dependability and Security Specification 1.

Computer Engineering Group Brandenburg University of Technology at Cottbus 1 Ressource Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors.

University of Paderborn Software Engineering Group Prof. Dr. Wilhelm Schäfer Matthias Tichy - Design of Self-Managing Dependable Systems with UML and Fault.

Karan Maini and Sriharsha Yerramalla ECE 753 Project #10 May 1, 2014 Tool for Customizing Fault Tolerance in a System.

Fault-Tolerant Systems Design Part 1.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.

1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.

CprE 458/558: Real-Time Systems

CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.

FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.

Fault-Tolerant Systems Design Part 1.

Computer Architecture Lecture 32 Fasih ur Rehman.

1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012.

1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.

Lecture 3. Combinational Logic 2 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education & Research.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Combinational Circuit Design. Digital Circuits Combinational CircuitsSequential Circuits Output is determined by current values of inputs only. Output.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

Lecture 3. Combinational Logic #2 Prof. Taeweon Suh Computer Science & Engineering Korea University COSE221, COMP211 Logic Design.

Self-Checking Circuits

Fault Tolerance & Reliability CDA 5140 Spring 2006

COMP211 Computer Logic Design Lecture 3. Combinational Logic 2

Fault Tolerance In Operating System

Coding Theory Dan Siewiorek June 2012.

5. Combinational circuits

Sequential circuits and Digital System Reliability

Fault Tolerance Distributed Web-based Systems

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Programmable Configurations

Hardware Assisted Fault Tolerance Using Reconfigurable Logic

FAULT-TOLERANT TECHNIQUES FOR NANOCOMPUTERS

Dynamic Verification of Sequential Consistency

Seminar on Enterprise Software

Presentation transcript:

(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering Duke University

Daniel Sorin slide 2 A Computing Challenge for NASA NASA relies on computers NASA is much more demanding than most users –Must operate in harsh environments that cause hard faults –Must operate correctly for years –Must not require human to repair problems Our goal –Designing autonomic computer systems –Permanent faults will occur and computer will handle them

Daniel Sorin slide 3 But Isn’t This a Solved Problem? We could just use TMR (triple modular redundancy) CPU voter CPU But too much power usage to be feasible Especially for modern microprocessors output

Daniel Sorin slide 4 Key Observation Computer hardware is already modular –Improves performance –Simplifies design and verification Modular exists at many levels –Multiple processors per chip (CMP) –Multiple thread contexts per processor –Multiple functional units (e.g., adders) per processor –Multiple 4-bit adders in 64-bit adder –Multiple 1-bit adders in 4-bit adder –Etc. We can leverage this modularity!

Daniel Sorin slide 5 Modular Redundancy If computer has N widgets, add extra widget(s) Then provide: 1.Ability to detect errors 2.Ability to diagnose hard faults (that cause errors) 3.Ability to reconfigure and map in spare widget Cost: 1/N (or 2/N) instead of 2*N for TMR Benefit: can sometimes even be better than TMR! Simplistic example: –For processor with 8 adders, providing 2 more adders can tolerate 2 hard faults (in adders) –Replicating entire processor 3 times (TMR) can only tolerate one hard fault (in an adder)

Daniel Sorin slide 6 HMR: Hierarchical Modular Redundancy Provide modular redundancy at many levels –Processors, adders, multipliers, etc. Engineering issues involved in HMR –Allocating resources –Managing costs

Daniel Sorin slide 7 Allocating Resources For given hardware budget, how to allocate it Which level to allocate spares? –Better to have extra processor? –Or extra adders in each processor? –Or some combination of both? How many spares at each level? Can a spare be mapped in anywhere in system?

Daniel Sorin slide 8 Managing Costs Costs: extra modules, wires, and multiplexers Example: 3-bit addition, with module = 1-bit adder adder A1 B1 A2 B2 A3 B3 mux C1 C2 C3 mux

Daniel Sorin slide 9 Current Research Thrust #1 Explore modular redundancy within microprocessor Add extra array entries –In reorder buffer (ROB), branch history table (BHT), etc. Add extra functional units –Adders, multipliers, etc. 1.For error detection –Use “DIVA” or redundant threads 2.For hard fault diagnosis –Use threshold error counters 3.For reconfiguration –Use extra wires and multiplexers Modular array entry design published in International Symposium on Dependable Systems and Networks, 2004

Daniel Sorin slide 10 Current Research Thrust #2 Explore modular redundancy within 64-bit adder Start with 64-bit carry lookahead adder (CLA) –Hierarchy of 4-bit CLA modules Add 2 extra modules 1.Detect errors as before 2.Diagnose with counters and pattern matching –Based on error counter values, can diagnose fault! 3.Reconfigure with clever multiplexing scheme

Daniel Sorin slide 11 Conclusions and Future Work Hierarchical Modular Redundancy can provide high reliability at relatively low cost Future directions –Low-level: modular designs of components besides just adders (e.g., multipliers, decoding logic, etc.) –Mid-level: modular designs of microprocessors that can tolerate loss of currently critical logic (e.g., decoding) –High-level: HMR for chip multiprocessors

Daniel Sorin slide 12 Acknowledgments Several collaborators on this work Co-Investigator Prof. Sule Ozev (Duke ECE) Fred Bower (Duke CS grad and IBM) Mahmut Yilmaz (Duke ECE grad) Derek Hower (Duke ECE undergrad)