Using Software Rules To Enhance FPGA Reliability Chandru Mirchandani Lockheed-Martin September 7-9, 2005 P226-W/MAPLD2005 MIRCHANDANI 1
P226-W/MAPLD2005 MIRCHANDANI 2 FPGA Fault Tolerance Historically realized through triple redundancy, error correcting codes and replicated elements The fault tolerance process is as good as the tests run to validate its performance, e.g. When invalid data is not ignored due to an inherent fault in the lookup and compare sequence When invalid data is not ignored due to an inherent fault in the lookup and compare sequence The testing was not rigorous enough The testing was not rigorous enough The testing was not complete The testing was not complete Lack of real estate and logic on the device precludes the ideal solution, Make educated judgment calls on how much is acceptable and for how long Make educated judgment calls on how much is acceptable and for how long
P226-W/MAPLD2005 MIRCHANDANI 3 Reconfiguring FPGAs Replicated circuitry or triple redundancy, achieved by having different devices or on the same device Same device to replicate a complete circuit will not meet the constraint of lack of real estate and will decrease performance due to routing Could be used to one’s advantage if sub-sets of the circuit were replicated Yu and McCluskey - reconfiguring the chip so that a damaged configurable logic block (CLB) or routing resource is not used by a design
P226-W/MAPLD2005 MIRCHANDANI 4 Types of Errors Yu and McCluskey – When concurrent error detection (CED) mechanisms detect an error for the first time, it is treated as a transient error; otherwise, it is treated as a permanent error Transient error - the system recovers from corrupt data and resumes normal operation Transient error - the system recovers from corrupt data and resumes normal operation Permanent fault - fault diagnosis is initiated to determine the location of the damaged resource, and a suitable configuration is chosen according to the available area Permanent fault - fault diagnosis is initiated to determine the location of the damaged resource, and a suitable configuration is chosen according to the available area In the case of both types of errors, the design in VHDL, i.e. FPGA software is the key to success
Software Reliability Develop Criteria for Design Objective Acceptance Prioritize tasks or functions in order of criticality Develop metrics to measure performance of tasks with respect to constraints Evaluate design options based on measured reliability metrics P226/MAPLD2005 MIRCHANDANI 5
Typical Software Options Critical software functions are distributed as redundant instances on multiple processors, thus minimizing the loss of service due to a processor failure…….. P226/MAPLD2005 MIRCHANDANI 6 Processor 1 Processor 2 Application A1 (I-ary) Application A1 (II-ary)
Redundant Instances of Software P226/MAPLD2005 MIRCHANDANI 7 Initially detect, contain and recover from faults as soon as possible, and in the event this is not possible Allow the control to be passed on to the redundant instance within the reliability and availability requirements levied on the system Finally, include language defined mechanisms to detect and prevent the propagation of errors
Methodology Estimate the reliability based on instruction set and operational usage Re-design critical elements to decrease risk Re-evaluate the risk of failure based on a change in critical task design based on performance and requirements Re-evaluate the reliability based on failure rate Factor in the Uncertainty in Evaluation P226/MAPLD2005 MIRCHANDANI 8
P226-W/MAPLD2005 MIRCHANDANI 9 Task Times Task Class Steps Step Time (s task ) Task Time Total Tasks Time (t task ) Reading r x ri SrSrSrSr s r. x ri (s r. x r i).n r = t r Parsing p x pi spspspsp s p. x pi (s p. x p i).n p = t p Pre-processing p 1 x p1i s p1 s p1. x p1i (s p1. x p1 i).n p1 = t p1 Monitoring M x Mi sMsMsMsM s M. x Mi (s M. x M i).n M = t M Sorting s x si ssssssss s s. x si (s s. x s i).n s = t s Processing P x Pi sPsPsPsP s P. x Pi (s P. x P i).n P = t P Post-processing p 2 x p2i s p2 s p2. x p2i (s p2. x p2 i).n p2 = t p2 Status-gathering S x Si sSsSsSsS s S. x Si (s S. x S i).n S = t S Writing w x wi swswswsw s w. x wi (s w. x w i).n w = t w
P226-W/MAPLD2005 MIRCHANDANI 10 FPGA System - Conceptual Consider a FPGA-based system comprising of the Reading, Parsing and Pre-Processing Tasks….. …each Task is a subsystem
P226-W/MAPLD2005 MIRCHANDANI 11 Task Reliability Block Diagram [1-{1-(exp(-(1-γ h ).λ shwi.t).exp(-(1-γ s ).λ sswi.t))}^2] (exp(-γ h.u h.λ hwi.t).exp(-γ s.u s.λ swi.t) ANDOR
P226-W/MAPLD2005 MIRCHANDANI 12 Definitions Calendar Time – τ Mission Time to Calculate the Reliability Execution – e i Percentage of Mission Time used by the Task (or Subsystem) Execution Time – t e i. τ Usage for SW Percentage of the Total software used by the Task Usage for HW Percentage of Area of the Active portion of the Device used by Task λ shwi Failure Intensity of Task i hardware with respect to Execution time λ sswi Failure Intensity of Task i software with respect to Execution time γ hi Fraction of Task i Task hardware that are common cause failures γ si Fraction of Task i Task software that are common cause failures
Parameters & Derivations Failure Intensity: λ shwi = λ hwi.u h.(1-γ h ) Failure Intensity: λ sswi = λ swi.u s.(1-γ s ) Common Cause:λ hwi.u h.(γ h ) and λ swi.u s.(γ s ) Execution Time t:e i. Subsystem Reliability R SSi :Subsystem Reliability System Reliability R S : R SS1. R SS2. R SS3 P226/MAPLD2005 MIRCHANDANI 13 ReadingParsingPre-Processing Usage SW - u s Usage HW - u h λ hwi λ swi Execution - e i
P226-W/MAPLD2005 MIRCHANDANI 14 Extending the Rules The programmed design, be it the original duplex design, duplicated or diverse, or the option for re-configuration, will optimize whatever option is used to enhance Fault Tolerance For example, in the Reading Task, it is shown that the area usage and operational profile have an effect on the predicted overall reliability of the FPGA-based design Yu and McCluskey, state that the designs of the CED techniques are area dependent, more conservative a design in terms of area, less efficiently will the error detection algorithm perform, however, but more efficiently or optimally the re-configured design in the event of a permanent failure.
P226-W/MAPLD2005 MIRCHANDANI 15 Further Extension Area usage has a higher propensity for multiple faults, the operational profile that exercises a part of the code more often, then the design and its associated code has a greater propensity for failures The common cause fractions used in the paper are relative numbers to illustrate the model Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area.
P226-W/MAPLD2005 MIRCHANDANI 16 Assertions The common cause fractions used in the paper are relative numbers to illustrate the model Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. Implemented on different devices, this fraction drops to ¼ % because now the physical defects are almost negligible, and the only common effects are more environmental, i.e. temperature, power and external stresses. Implemented on different devices, this fraction drops to ¼ % because now the physical defects are almost negligible, and the only common effects are more environmental, i.e. temperature, power and external stresses.
P226-W/MAPLD2005 MIRCHANDANI 17 More Assertions Software common cause fraction is high in both cases, since we assume nearly all software failures are common cause, very little change from same device to different device, since the design implemented is the same, but because the devices are different, this a slight chance that certain timing conditions may vary and hence the ¼ % variation Diverse design paradigm, the hardware dependence remains in the same ratio relatively, but the software fractions vary drastically. In the same device, the common cause fraction is 50 % and it drops to 10 % in the case of diverse designs on different devices
P226-W/MAPLD2005 MIRCHANDANI 18 System Configuration Options Configuration HW Common Cause Fraction SW Common Cause Fraction γhγhγhγh γsγsγsγs Same Code & Device Same Code & Diff Devices Diff Code & Same Device Diff Code & Devices
Results P226/MAPLD2005 MIRCHANDANI 19 OptionConfiguration FPGA-based System Reliability 1 Same Code, Same Devices Same Code, Diff Devices Diff Code, Same Devices Diff Code, Diff Devices
Conclusions Cost and Schedule Slips Development Delays and Costs Adaptive Model Optimization and Design Constraints Contact Address: P226/MAPLD2005 MIRCHANDANI 20