FAULT TOLERANCE IN FPGA BASED SPACE-BORNE COMPUTING SYSTEMS Niharika Chatla Vibhav Kundalia 9043-1959 3156-1935.

Slides:



Advertisements
Similar presentations
Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Advertisements

Baloch 1MAPLD 2005/1024-L Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan 1,2.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Sana Rezgui 1, Jeffrey George 2, Gary Swift 3, Kevin Somervill 4, Carl Carmichael 1 and Gregory Allen 3, SEU Mitigation of a Soft Embedded Processor in.
Scrubbing Approaches for Kintex-7 FPGAs
Fault-Tolerant Systems Design Part 1.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
1 Fault Tolerant FPGA Co-processing Toolkit Oral defense in partial fulfillment of the requirements for the degree of Master of Science 2006 Oral defense.
ICAP CONTROLLER FOR HIGH-RELIABLE INTERNAL SCRUBBING Quinn Martin Steven Fingulin.
Reconfigurable Computers in Space: Problems, Solutions and Future Directions Neil W. Bergmann, Anwar S. Dawood CRC for Satellite Systems Queensland University.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
Making Services Fault Tolerant
Students:Guy Derry Gil Wiechman Instructor:Isaschar Walter In cooperation with MOD Winter-Spring 2003 Students:Guy Derry Gil Wiechman Instructor:Isaschar.
1 Presenter: Chien-Chih Chen. 2 An Assertion Library for On- Chip White-Box Verification at Run-Time On-Chip Verification of NoCs Using Assertion Processors.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
Configurable System-on-Chip: Xilinx EDK
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Bitstream Relocation with Local Clock Domains for Partially Reconfigurable FPGAs Adam Flynn, Ann Gordon-Ross, Alan D. George NSF Center for High-Performance.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Benefits of Partial Reconfiguration Reducing the size of the FPGA device required to implement a given function, with consequent reductions in cost and.
S1.6 Requirements: KnightSat C&DH RequirementSourceVerification Source Document Test/Analysis Number S1.6-1Provide reliable, real-time access and control.
Uniform Reconfigurable Processing Module for Design and Manufacturing Integration V. Kirischian, S. Zhelnokov, P.W. Chun, L. Kirischian and V. Geurkov.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
A comprehensive method for the evaluation of the sensitivity to SEUs of FPGA-based applications A comprehensive method for the evaluation of the sensitivity.
FPGA IRRADIATION and TESTING PLANS (Update) Ray Mountain, Marina Artuso, Bin Gui Syracuse University OUTLINE: 1.Core 2.Peripheral 3.Testing Procedures.
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
Reconfiguration Based Fault-Tolerant Systems Design - Survey of Approaches Jan Balach, Jan Balach, Ondřej Novák FIT, CTU in Prague MEMICS 2010.
POLITECNICO DI MILANO Reconfiguration 4 Reliability design methodology for reliability assessment and enhancement of FPGA-based systems Dynamic Reconfigurability.
Heng Tan Ronald Demara A Device-Controlled Dynamic Configuration Framework Supporting Heterogeneous Resource Management.
Hardware Support for Trustworthy Systems Ted Huffmire ACACES 2012 Fiuggi, Italy.
Architectural Optimizations Ed Carlisle. DARA: A LOW-COST RELIABLE ARCHITECTURE BASED ON UNHARDENED DEVICES AND ITS CASE STUDY OF RADIATION STRESS TEST.
Fault-Tolerant Systems Design Part 1.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation Mrs. Shazia Maqbool and Dr. Craig I Underwood Maqbool 1 MAPLD 2005/P181.
MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.
CprE 458/558: Real-Time Systems
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.
R ECONFIGURABLE SECURITY SUPPORT FOR EMBEDDED SYSTEMS 1 AKSHATA VARDHARAJ.
1/14 Merging BIST and Configurable Computing Technology to Improve Availability in Space Applications Eduardo Bezerra 1, Fabian Vargas 2, Michael Paul.
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
1 CzajkowskiMAPLD 2005/138 Radiation Hardened, Ultra Low Power, High Performance Space Computer Leveraging COTS Microelectronics With SEE Mitigation D.
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
بسم الله الرحمن الرحيم MEMORY AND I/O.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
Programmable Logic Devices
CFTP ( Configurable Fault Tolerant Processor )
SEU Mitigation Techniques for Virtex FPGAs in Space Applications
FPGA: Real needs and limits
Radiation Tolerance of an Used in a Large Tracking Detector
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
Fault Tolerance Distributed Web-based Systems
Mattan Erez The University of Texas at Austin July 2015
Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan1,2 Dr.Adrian Stoica3.
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Xilinx Kintex7 SRAM-based FPGA
Seminar on Enterprise Software
Presentation transcript:

FAULT TOLERANCE IN FPGA BASED SPACE-BORNE COMPUTING SYSTEMS Niharika Chatla Vibhav Kundalia

INTRODUCTION  Long mission times and harsh environment are a challenge for electronic circuits in space applications.  Hence it is important to develop efficient error mitigation techniques in order to cope with the transient and permanent faults to improve the reliability of space applications.

FAULT TOLERANCE  The fault may occur in either hardware or software.  The major hardware faults in space applications are radiation induced faults which can be transient or permanent.  Transient faults are caused by single event effects which are induced by high-energy subatomic particles.  Permanent faults are caused by radiation effects, hidden manufacturing defects and aging of the die.  Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components.

Reconfigurable architecture  Triple modular redundancy is well known to prevent the transient faults from causing functional errors.  But the TMR modules cannot heal the permanent errors themselves.  Hence a dynamic reconfigurable computing FPGA architecture based on a TMR system has been proposed to deal with the transient and permanent errors.  Reconfiguration provides more possibilities for SRAM- based FPGAs into space. Additionally, reconfiguration is an improvement in terms of resource utilization and costs.

Triple modular redundant system  The TMR system mainly includes three TMR modules and the TMR voter.  The control and data signals form each module are voted against each other by a TMR voter.  The function of a voter is to decide the real output of the system.  It uses the classic majority voting scheme.

Fig. The architecture of TMR System 6

SYSTEM ARCHITECTURE  This system implements TMR system which has FPGA incorporated into it for reconfiguration.  Each module has a processor, FPGA, memories and communication portion.  The base processor is 32-bit RISC processor based on SPARC V8 architecture.  The processor implements both the controller and scheduler of the given system.

8

 Memories are used to store all the partial reconfiguration bitstream data information and store the software program.  There are three memories, 8M Flash, 1M SRAM and 16M SDRAM.  Hardware units mapped into the FPGA can be interfaced to the system bus through a peripheral bus.  I/O of the processor is used to configure the FPGA.  Download of the FPGA bitstream is performed at the beginning from the FLASH.  To allow validation of the FPGA configuration, the bitstream may be read back by the processor.

DYNAMIC PARTIAL RECONFIGURATION  Each TMR module can be divided into two modules: the fixed module and the reconfigurable module.  Fixed module including the peripheral bus and the IP cores will not be reconfigured at run-time.  The reconfigurable modules placed in reconfigurable area are used for different algorithms.

Fig. FPGA resource partitioning and module placement

Module Design  The design methodology is based on modular design concept  Three main modules in whole system : System module, ICAP module, Reconfigurable module.  All connections across different areas connect each other through the Bus Macro, through which the reconfigurable modules connect to the static part of the design and other modules.

SYSTEM MODULE  System module : Data exchange module, synchronization module and required peripherals to run the system are present in it.  PB2RM and BM2ICAP are the interfaces needed between the peripheral bus and the bus macro to write back or read back.

ICAP Module  ICAP module is used to read/write a configuration from/to the BRAM to/ from a specific reconfigurable module  It can finish the self-reconfiguration of the FPGA.  Whenever a new reconfigurable module has been mapped and it starts its computation, the ICAP informs the processor that the reconfiguration action ended.  Then the controller enables all the communications interrupted by the reconfiguration. 14

Reconfigurable module  The implantation of the RM modules is checked in FGPA compiler before implementation.  A CAN receiver is realized in RM module.  Finally all the bitstreams needed to implement the system description onto FPGA is created and verified. 15

SELF-REPAIR  Self-repairing is done in a sequence of three steps : Error detection and localization, Reconfiguration and Recovery. 16

Error detection and localization  It detects faults and raises the countermeasures.  Localization is done by halting the faulty units and all the buffers associated with it.  Transient fault detects accumulated to detect the permanent faults. 17

Reconfiguration  Repair process is started after an error is detected and its location known.  In case of transient fault reprogramming the FPGA solves the error.  In hardware faults, the circuit has to be moved from the defective area and mapped into unused resources in the FPGA. 18

Recovery  After reconfiguration, the newly placed device is reset and synchronized with the rest of the device. 19

Comparison of the original TMR system with the reconfigurable TMR system architecture 20 Original TMR System Reconfigurable TMR system Computing Performance ARM7 60MIPS SPARC V8 150MIPS 50FLOPS Weight7.6 kg750g Power Consumption 27W3.6W Peripheral Device76 chips13 chips

Decentralized Run-Time Recovery Mechanism for Transient and Permanent Hardware Faults for Space-borne FPGA-based Computing Systems

Concept of Collaborative Macro-Function Units (CMFUs)  Growing demand for commercial space-borne computing systems  Space-grade radiations hardened FPGA vs Commercially available FPGA  Built-In-Self-Recovery (BISR) based on CMFUs  CMFUs can collaborate to perform specific mode of operation  Decentralized  Increased reliability, decreased recovery time

Distributed Architecture for Fault Tolerance and Mitigation

CMFU  Macro-function-level processors based on customizable cores  Cores are augmented with additional functionality  Implemented in form of a Co-Op Unit  A) Scheduling of Processing and BIST Activities  B) Scheduling of Connections and Disconnections  C) Mode Monitoring, Connectivity Requests

Co-op Unit FSM Behavior

Distributed Control and Communication Infrastructure  Must provide communication capabilities to components in system  Must be able to implement the connectivity requirements and respond to connectivity change requests correctly  Must distribute health information for the system components (including its own)

Communication and Control Infrastructure Architecture

Connector Unit FSM Behaviour

Example System Architecture

Process of Mitigation  1) Download new bit-stream for the component, targeting a different slot  2) Change operation mode, such that the relocated component is used  3) Load a test bit-stream into the faulty slot or connector unit  4) Run BIST procedures determine if fault is permanent

Resource Costs and Overheads

 Resource overhead comparison

Conclusion and Shortfall  Conclusion  Allows for mitigation for both transient and permanent faults by dynamic component re- location  Recovery from single point faults take the same time and therefore predictable  Reduced recovery time  Hardware overhead per slot is not much compared to duplex or TMR solutions  Shortfall  The configuration engines are not discussed

Reference papers  Paper 1 -  Implementation of a reconfigurable computing system for spaceapplications Implementation of a reconfigurable computing system for spaceapplications  Yimao Cai ; Yuanfu Zhao ; Lidong Lan System Science, Engineering Design and Manufacturing Informatization (ICSEM), 2011 International Conference on Volume: 2 DOI: /ICSSEM Publication Year: 2011, Page(s): Yimao CaiYuanfu ZhaoLidong Lan System Science, Engineering Design and Manufacturing Informatization (ICSEM), 2011 International Conference on /ICSSEM  Paper 2 -  Decentralized run-time recovery mechanism for transient and permanent hardware faults for space-borne FPGA- based computingsystems Decentralized run-time recovery mechanism for transient and permanent hardware faults for space-borne FPGA- based computingsystems  Dumitriu, V. ; Kirischian, L. ; Kirischian, V. Adaptive Hardware and Systems (AHS), 2014 NASA/ESA Conference on DOI: /AHS Publication Year: 2014, Page(s): Dumitriu, V.Kirischian, L.Kirischian, V. Adaptive Hardware and Systems (AHS), 2014 NASA/ESA Conference on /AHS