Hardware Assisted Fault Tolerance Using Reconfigurable Logic Ayse K. Coskun CSE 237A - Project 06.12.2004
Project Outline Motivation Hardware Fault Tolerance Techniques Fault Tolerance Design Flow Example Circuitry for Implementation Redundancy Fault Masking, Error Detection, Diagnosis Reconfiguration Eliminating faulty blocks Results & Discussion
Motivation Fault resilience is required at certain levels in each circuit High fault rates in VDSM & nanoscale devices Fault masking Transient (temporary) errors, single event upsets High clock rates, fast propagation of faults Further solutions needed for: Eliminating defects to increase manufacturing yield Eliminating permanent faults (run-time) Increasing device life-time
HW Assisted Fault Tolerance Pros Fast detection and recovery Transparent to user No other way present to build circuits with high reliability Cons Area overhead Timing overhead problem may be present Hard-real time applications Considerable design time & effort
Goals Study fault tolerance techniques and design flow Implement a redundancy based circuit Fault masking Error detection and diagnosis Recovery Practice reconfiguration on the circuit Mark off faulty blocks and reconfigure (off-line) Dynamic Reconfiguration - canceled
Fault Tolerance Design Flow
Example Circuitry for Implementation (VHDL)
Redundancy Triple Modular Redundancy (TMR) Fast fault masking & recovery for hard real-time and safety-critical applications Place voters at the outputs of every clocked block (No voting scheme for combinational circuits) Masking ability Only one copy is faulty More than one copy is faulty but errors are at different register locations Duplication can detect errors but cannot mask them Diagnosis and recovery Additional circuitry added for diagnosis and recovery
TMR with Roll-Forward Recovery
Fault Insertion & Diagnosis Adding MUXes at several points to force lines to faulty values ModelSim verification Diagnosis:
Xilinx RTL Schematic - Top level
Controller –RTL schematic
Reconfiguration Dynamic Reconfiguration: Off-line reconfiguration: Needs interface to load different configurations online to the chip Canceled because of complexity Off-line reconfiguration: Xilinx Area Constraints Editor Edit *.ucf file AREA_GROUP: includes selected instances of circuit INST “instance” AREA_GROUP="GROUP1"; ... AREA_GROUP "GROUP1" RANGE=SLICE_X0Y0:SLICE_X7Y35; AREA_GROUP "GROUP1" GROUP=CLOSED; AREA_GROUP "GROUP1" PLACE=OPEN;
Reconfiguration cont’d
Reconfiguration cont’d Common reconfiguration approaches: Tile based Column-based Hierarchical (column & row ) based Xilinx design flow for reconfiguration: VHDL Synthesis Translate Map Place&Route Edit Floorplan /*.ucf file Back to Translate ...
Before & After Reconfiguration
Evaluation and Discussion TMR with roll-forward has effective fault masking and recovery Column based reconfiguration does not add significant area overhead to TMR circuit Fault tolerant design has considerable design time and effort problem Development of automated FT design flow Fault diagnosis is also a bottleneck for large scale circuits
Summary Fault tolerance design flow Redundancy methods: Fault Masking Fault Detection Recovery TMR roll-forward implementation Reconfiguration: Dynamic / Off-line Off-line column-based implementation