Using Software Rules To Enhance FPGA Reliability Chandru Mirchandani Lockheed-Martin September 7-9, 2005 P226-W/MAPLD2005 MIRCHANDANI 1.

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

CprE 458/558: Real-Time Systems (G. Manimaran)1 CprE 458/558: Real-Time Systems Fault-Tolerant Scheduling Techniques.
CSCE430/830 Computer Architecture
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.
Fault-Tolerant Systems Design Part 1.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Chapter 4 Quality Assurance in Context
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
8. Fault Tolerance in Software 8.5 Construction of Acceptance Tests Goal Goal: describe the types and selection criteria for acceptance tests Two levels.
1 Chapter Fault Tolerant Design of Digital Systems.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
8. Fault Tolerance in Software
Convolutional Code Based Concurrent Error Detection in Finite State Machines Konstantinos N. Rokas Advisor: Prof. Yiorgos Makris.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
ED 4 I: Error Detection by Diverse Data and Duplicated Instructions Greg Bronevetsky.
EEE499 Real Time Systems Software Reliability (Part II)
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
CIS 376 Bruce R. Maxim UM-Dearborn
IV&V Facility Model-based Design Verification IVV Annual Workshop September, 2009 Tom Hempler.
Software Verification and Validation (V&V) By Roger U. Fujii Presented by Donovan Faustino.
By : Nabeel Ahmed Superior University Grw Campus.
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
A N OPTIMAL RELIABILITY ALLOCATION METHOD FOR DIGITAL SUBSTATION SYSTEMS Y UZHOU H U, P EICHAO Z HANG, Y ONGCHUN S U, Y U Z OU Adviser: Frank, Yeong-Sung.
Lesson №2. is the unique activity that has a beginning and an end time, aimed at achieving a predetermined result/goal, the creation of a specific, unique.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
FMEA-technique of Web Services Analysis and Dependability Ensuring Anatoliy Gorbenko Vyacheslav Kharchenko Olga Tarasyuk National Aerospace University.
1SAS 03/ GSFC/SATC- NSWC-DD System and Software Reliability Dolores R. Wallace SRS Technologies Software Assurance Technology Center
Software Reliability SEG3202 N. El Kadri.
TELSIKS 2005 Concurrent error detection in FSMs using transition checking technique G. Lj. Djordjevic, T. R. Stankovic and M. K. Stojcev Department of.
POLITECNICO DI MILANO Reconfiguration 4 Reliability design methodology for reliability assessment and enhancement of FPGA-based systems Dynamic Reconfigurability.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Adaptive control and process systems. Design and methods and control strategies 1.
MSE-415: B. Hawrylo Chapter 13 – Robust Design What is robust design/process/product?: A robust product (process) is one that performs as intended even.
1 IWLS 2003 Faults and Uncertainty – Do we need a Totally New Approach? Lou Scheffer.
Quality Assurance.
CprE 458/558: Real-Time Systems
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Use of Fieldbus in safety related systems, an evaluation study of WorldFIP according to proven-in-use concept of IEC Jean Pierre Froidevaux WorldFIP.
Fault-Tolerant Systems Design Part 1.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
Utility Model for Minimizing Risk Chandru Mirchandani Lockheed-Martin August 9, 2004 P115/MAPLD2004 MIRCHANDANI 1.
Using Software Rules To Enhance FPGA Reliability Chandru Mirchandani Lockheed-Martin Transportation & Security Solutions September 7-9, 2005 P226/MAPLD2005.
“Politehnica” University of Timisoara Course No. 3: Project E MBRYONICS Evolvable Systems Winter Semester 2010.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Objectives Understand Corrective, Perfective and Preventive maintenance Discuss the general concepts of software configuration management.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Week#3 Software Quality Engineering.
Software Life Cycle “What happens in the ‘life’ of software”
Real-time Software Design
RAID RAID Mukesh N Tekwani
Mi Zhou, Li-Hong Shang Yu Hu, Jing Zhang
An Introduction to Software Architecture
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Regression Testing.
Presented By: Darlene Banta
RAID RAID Mukesh N Tekwani April 23, 2019
Chapter 2 Operating System Overview
Fault Tolerant Systems in a Space Environment
Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University
Presentation transcript:

Using Software Rules To Enhance FPGA Reliability Chandru Mirchandani Lockheed-Martin September 7-9, 2005 P226-W/MAPLD2005 MIRCHANDANI 1

P226-W/MAPLD2005 MIRCHANDANI 2 FPGA Fault Tolerance  Historically realized through triple redundancy, error correcting codes and replicated elements  The fault tolerance process is as good as the tests run to validate its performance, e.g. When invalid data is not ignored due to an inherent fault in the lookup and compare sequence When invalid data is not ignored due to an inherent fault in the lookup and compare sequence The testing was not rigorous enough The testing was not rigorous enough The testing was not complete The testing was not complete  Lack of real estate and logic on the device precludes the ideal solution, Make educated judgment calls on how much is acceptable and for how long Make educated judgment calls on how much is acceptable and for how long

P226-W/MAPLD2005 MIRCHANDANI 3 Reconfiguring FPGAs  Replicated circuitry or triple redundancy, achieved by having different devices or on the same device  Same device to replicate a complete circuit will not meet the constraint of lack of real estate and will decrease performance due to routing  Could be used to one’s advantage if sub-sets of the circuit were replicated  Yu and McCluskey - reconfiguring the chip so that a damaged configurable logic block (CLB) or routing resource is not used by a design

P226-W/MAPLD2005 MIRCHANDANI 4 Types of Errors  Yu and McCluskey – When concurrent error detection (CED) mechanisms detect an error for the first time, it is treated as a transient error; otherwise, it is treated as a permanent error Transient error - the system recovers from corrupt data and resumes normal operation Transient error - the system recovers from corrupt data and resumes normal operation Permanent fault - fault diagnosis is initiated to determine the location of the damaged resource, and a suitable configuration is chosen according to the available area Permanent fault - fault diagnosis is initiated to determine the location of the damaged resource, and a suitable configuration is chosen according to the available area  In the case of both types of errors, the design in VHDL, i.e. FPGA software is the key to success

Software Reliability  Develop Criteria for Design Objective Acceptance  Prioritize tasks or functions in order of criticality  Develop metrics to measure performance of tasks with respect to constraints  Evaluate design options based on measured reliability metrics P226/MAPLD2005 MIRCHANDANI 5

Typical Software Options  Critical software functions are distributed as redundant instances on multiple processors, thus minimizing the loss of service due to a processor failure…….. P226/MAPLD2005 MIRCHANDANI 6 Processor 1 Processor 2 Application A1 (I-ary) Application A1 (II-ary)

Redundant Instances of Software P226/MAPLD2005 MIRCHANDANI 7  Initially detect, contain and recover from faults as soon as possible, and in the event this is not possible  Allow the control to be passed on to the redundant instance within the reliability and availability requirements levied on the system  Finally, include language defined mechanisms to detect and prevent the propagation of errors

Methodology  Estimate the reliability based on instruction set and operational usage  Re-design critical elements to decrease risk  Re-evaluate the risk of failure based on a change in critical task design based on performance and requirements  Re-evaluate the reliability based on failure rate  Factor in the Uncertainty in Evaluation P226/MAPLD2005 MIRCHANDANI 8

P226-W/MAPLD2005 MIRCHANDANI 9 Task Times Task Class Steps Step Time (s task ) Task Time Total Tasks Time (t task ) Reading r  x ri SrSrSrSr s r.  x ri (s r.  x r i).n r = t r Parsing p  x pi spspspsp s p.  x pi (s p.  x p i).n p = t p Pre-processing p 1  x p1i s p1 s p1.  x p1i (s p1.  x p1 i).n p1 = t p1 Monitoring M  x Mi sMsMsMsM s M.  x Mi (s M.  x M i).n M = t M Sorting s  x si ssssssss s s.  x si (s s.  x s i).n s = t s Processing P  x Pi sPsPsPsP s P.  x Pi (s P.  x P i).n P = t P Post-processing p 2  x p2i s p2 s p2.  x p2i (s p2.  x p2 i).n p2 = t p2 Status-gathering S  x Si sSsSsSsS s S.  x Si (s S.  x S i).n S = t S Writing w  x wi swswswsw s w.  x wi (s w.  x w i).n w = t w

P226-W/MAPLD2005 MIRCHANDANI 10 FPGA System - Conceptual  Consider a FPGA-based system comprising of the Reading, Parsing and Pre-Processing Tasks….. …each Task is a subsystem

P226-W/MAPLD2005 MIRCHANDANI 11 Task Reliability Block Diagram [1-{1-(exp(-(1-γ h ).λ shwi.t).exp(-(1-γ s ).λ sswi.t))}^2] (exp(-γ h.u h.λ hwi.t).exp(-γ s.u s.λ swi.t) ANDOR

P226-W/MAPLD2005 MIRCHANDANI 12 Definitions Calendar Time – τ Mission Time to Calculate the Reliability Execution – e i Percentage of Mission Time used by the Task (or Subsystem) Execution Time – t e i. τ Usage for SW Percentage of the Total software used by the Task Usage for HW Percentage of Area of the Active portion of the Device used by Task λ shwi Failure Intensity of Task i hardware with respect to Execution time λ sswi Failure Intensity of Task i software with respect to Execution time γ hi Fraction of Task i Task hardware that are common cause failures γ si Fraction of Task i Task software that are common cause failures

Parameters & Derivations  Failure Intensity: λ shwi = λ hwi.u h.(1-γ h )  Failure Intensity: λ sswi = λ swi.u s.(1-γ s )  Common Cause:λ hwi.u h.(γ h ) and λ swi.u s.(γ s )  Execution Time t:e i.   Subsystem Reliability  R SSi :Subsystem Reliability  System Reliability R S : R SS1. R SS2. R SS3 P226/MAPLD2005 MIRCHANDANI 13 ReadingParsingPre-Processing Usage SW - u s Usage HW - u h λ hwi λ swi Execution - e i

P226-W/MAPLD2005 MIRCHANDANI 14 Extending the Rules  The programmed design, be it the original duplex design, duplicated or diverse, or the option for re-configuration, will optimize whatever option is used to enhance Fault Tolerance  For example, in the Reading Task, it is shown that the area usage and operational profile have an effect on the predicted overall reliability of the FPGA-based design  Yu and McCluskey, state that the designs of the CED techniques are area dependent, more conservative a design in terms of area, less efficiently will the error detection algorithm perform, however, but more efficiently or optimally the re-configured design in the event of a permanent failure.

P226-W/MAPLD2005 MIRCHANDANI 15 Further Extension  Area usage has a higher propensity for multiple faults, the operational profile that exercises a part of the code more often, then the design and its associated code has a greater propensity for failures  The common cause fractions used in the paper are relative numbers to illustrate the model Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area.

P226-W/MAPLD2005 MIRCHANDANI 16 Assertions  The common cause fractions used in the paper are relative numbers to illustrate the model Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. Implemented on different devices, this fraction drops to ¼ % because now the physical defects are almost negligible, and the only common effects are more environmental, i.e. temperature, power and external stresses. Implemented on different devices, this fraction drops to ¼ % because now the physical defects are almost negligible, and the only common effects are more environmental, i.e. temperature, power and external stresses.

P226-W/MAPLD2005 MIRCHANDANI 17 More Assertions  Software common cause fraction is high in both cases, since we assume nearly all software failures are common cause, very little change from same device to different device, since the design implemented is the same, but because the devices are different, this a slight chance that certain timing conditions may vary and hence the ¼ % variation  Diverse design paradigm, the hardware dependence remains in the same ratio relatively, but the software fractions vary drastically. In the same device, the common cause fraction is 50 % and it drops to 10 % in the case of diverse designs on different devices

P226-W/MAPLD2005 MIRCHANDANI 18 System Configuration Options Configuration HW Common Cause Fraction SW Common Cause Fraction γhγhγhγh γsγsγsγs Same Code & Device Same Code & Diff Devices Diff Code & Same Device Diff Code & Devices

Results P226/MAPLD2005 MIRCHANDANI 19 OptionConfiguration FPGA-based System Reliability 1 Same Code, Same Devices Same Code, Diff Devices Diff Code, Same Devices Diff Code, Diff Devices

Conclusions  Cost and Schedule Slips  Development Delays and Costs  Adaptive Model  Optimization and Design Constraints Contact Address: P226/MAPLD2005 MIRCHANDANI 20