Fault Tolerance in Reconfigurable Computing / FPGAs Bayram Kurumahmut CMPE 516 MS Computer Engineering Bogazici University
Outline Introduction Modify Configurable Logic Block (CLB) Dynamic Serial Testing Built-In Self Healing (BISH) Hardware Voter Configurable Fault Tolerant Processor (CFTP) Self-Checking Logic Design (SCLD) CLB Functional Testing
Introduction Configurable Logic Block (CLB) Interconnect Wires Interconnect Switches Configured by SRAM contents Configuration SRAM
Modify CLB [4] Consider faults only in CLB Shift configuration data –Means load only one configuration for test Very slow process –Shift this configuration for next tests Do not change physical design of running application No intervention at hardware level –Faster –Better results in test diagnosis and defect/fault tolerance
Modify CLB [4] (Cont’d) SRAM –Assume this as faulty free –Has configuration data –Modify this to enable shifting configuration Adding a multiplexer –Decide shifting direction Shifting to east/west/north/south
Modify CLB [4] (Cont’d) Hardware overhead –Calculate additional transistor count –Calculate device transistor count –Compare them
Dynamic Serial vs Parallel [5] Reduce test configuration time Require less i/o pin Faster and easier
Dynamic Serial vs Parallel [5] (Cont’d) Consider unprogrammed FPGAs to test –No a specific user designed application configuration –Consider all configurations Generate and download configurations –Time consuming Decompose number of configurations Find test patterns
Dynamic Serial Test [5] (Cont’d) Function unit –Multiplexers and one D-Type Flip Flop –Test Pattern requirements for multiplexers Detect stuck-on/off faults of them Stuck-at faults of all their i/o nets Bridge faults of data inputs
Dynamic Serial Test [5] (Cont’d) 11 Test configuration (TC) for function unit Provide an efficient way to test many function units in short time –11 TC * 4096 = TC for XC6216 –Apply parallel testing after this step
Dynamic Serial Test [5] (Cont’d) Direct Parallel Testing –Test row or column cells at the same time –TC count increases with FPGA size, 11 TC per test unit –Not so efficient Two – Phase Parallel Testing –Reed-Muller Propagation Chain (RMPC) –22 TC per test unit, constant –Single faulty function unit location with 4 TC
Dynamic Serial Test [5] (Cont’d) Proposed Method –Link all function units into a chain –Test chain integrity in baypass mode –Test function unit with its 11 TCs and corresponding test patterns (TP) –Return to bypass mode –Repeat for the next function unit
Dynamic Serial Test [5] (Cont’d) Compare with parallel testing –Required less TC 13 TCs, not 22 TCs –Locate fault without additional TC –Use less i/o pin Simplify test observation
Dynamic Serial Test [5] (Cont’d) Disadvantage –Propagation path length Depends on array size –Integrate with parallel approach for large arrays Additional i/o pins
Built-In Self Healing (BISH) [8] Run time self configuration Implement a soft-processor –Manage and execute all procedures Fault detection/location/repair Modular redundancy for assurance of working correctly
BISH - Submicron technology problems [8] Single event upsets (SEU) –Radiation-induced transient errors caused by neutrons from cosmic rays –Alpha particles from packing material –do not physically damage the chip –Changes in memory cell values Incorrect data Improper instruction for processor Increase threat of electromigration –Physical damage to chip
BISH - Tasks [8] Detection –Scan chain Regulary capture net values Analyze them in soft-processor Diagnosis, Repair –Controlled also by soft-processor Applied for only SEUs
BISH - Fault Causes [8] SEU changing a circuit register value –Possibly a transient error –Invalid in next capture after register update SEU changing configuration memory cell –Wrong functionality assignment on FPGA –Readback configuration –CRC check –Partial reconfiguration if incohorency exits Permanent physical defect on FPGA –Mark down this defected area
Hardware Voter [6] Detect and correct single errors on inputs Bypass double errors in X1, X2, X3 by substuting errornous data with spare one, X4 Spare Detect and correct single errors Bypass double error by substituting errornous data with spare one Congruency level of accepted SEs Unrecoverable error signal
Configurable Fault Tolerant Processor (CFTP) [2] Applied for spacecraft onboard processing Triple Modular Redundancy (TMR) for soft processor on FPGA –Mitigate bit errors in computation by detecting and correcting them using voting logic –On orbit updates, reconfigurations, modifications Detect SEU-induced configuration faults
Self-Checking Logic Design (SCLD) [3] Map boolean functions into FPGA Functional cell Generate complementary outputs Checker cell –Verify correctness of final outputs Fault: same value at outputs Increase number of CLBs used but incorporate self-checking or testability features
SCLD – Fault Types [3] Single stuck-at faults in RAM cells Single stuck-at faults on any line of a CLB Functional faults in any multiplexer within a single CLB Functional faults in any D-Type Flip Flop within a single CLB Single stuck-at faults in any pass transistor connecting CLBs
SCLD [3] k-feasible –4 inputs for functional cells 4-feasible boolean functions required If not, decompose boolean function before map it on FPGA
SCLD – Algorithm [3] Decompose a sum-of-products expression into 4-feasible expression. Choose the expression with the minimum number of nodes Map each expression directly into a 4-input function cell Connect outputs of a pair of intermediate function cells to the inputs of a checker cell, and generate the equations for each output of the checker cell Cascade the checker cells to form a checker tree. The outputs of the function cell at the last stage are outputs circuit.
SCLD – Example [3]
SCLD – Implementation [3]
CLB Functional Testing [1] Gate level testing not required Use CLB functional property –AND, OR gate or any boolean expression Additional hardware to apply test –Multiplexer –Example for 2-inputs CLB
CLB Functional Testing - Redundant Faults [1] CLB function = AND gate –Sa0 on first data input of a multiplexer –Sa0 on second data input of a multiplexer –Sa0 on third data input of a multiplexer –Sa1 on fourth data input of a multiplexer CLB function = OR gate –Sa0 on first data input of a multiplexer –Sa1 on second data input of a multiplexer –Sa1 on third data input of a multiplexer –Sa1 on fourth data input of a multiplexer
CLB Functional Testing [1] Exhaustive testing applied Long test length but high fault coverage –99.81%, compare with 87.90% of gate-level testing
Conclusion Dynamic reconfigurable environments –Use flexible test of circuits –Repair errors by partial reconfiguration Do not disturb normal operation in defect on partial hardware –Design your processor on them to provide self-test on circuit
References [1] Testing of FPGA Logic Cells, E. Bareisa, V.Jusas, K.Motiejunas, R.Seinauskas, 2004 ISSN Elektronika IR Elektrotechnica. [2] Configurable Fault-Tolerant Processor (CFTP) for SpaceCraft Onboard Processing, Charles A. Hulme, Herschel H. Loomis, Alan A. Ross, Rong Yuan, 2004 IEEE Aerospace Conference Proceedings [3] Self-Checking Logic Design for FPGA Implementation, Parag K. Lala, Alfred L. Burress, 2003 IEEE Transactions on Instrumentation and Measurement [4] FPGAs and Fault Tolerance, Abderrahim Doumar, Hideo Ito, 2001 The 13th International Conference on Microelectronics [5] Fault Detection and Location of Dynamic Reconfigurable FPGAs, Chi-Feng Wu, Cheng-Wen Wu [6] FPGA Implementation of Hardware Voter, Milos D. Krstic, Mile K. Stojcev, TELSIKS 2001 IEEE [7] Testing the Configurability of Dynamic FPGAs, N. Park, S. J. Ruiwale, F. Lombardi, 2000 IEEE [8] A Self –Healing Real-Time System Based on Run-Time Self Reconfiguration, Manuel G. Gericota, Gustavo R. Alves, Jose M. Ferreira, 2005 IÊEE [9] Testing Approach within FPGA-based Fault Tolerant Systems, Abderrahim Doumar, Hideo Ito, 2000 IEEE