With Scott Arnold & Ryan Nuzzaci An Adaptive Fault-Tolerant Memory System for FPGA- based Architectures in the Space Environment Dan Fay, Alex Shye, Sayantan.

Slides:

Advertisements

Similar presentations

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs

Advertisements

Sana Rezgui 1, Jeffrey George 2, Gary Swift 3, Kevin Somervill 4, Carl Carmichael 1 and Gregory Allen 3, SEU Mitigation of a Soft Embedded Processor in.

10/14/2005Caltech1 Reliable State Machines Dr. Gary R Burke California Institute of Technology Jet Propulsion Laboratory.

Scrubbing Approaches for Kintex-7 FPGAs

Fault-Tolerant Systems Design Part 1.

1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.

HPEC 2012 Scrubbing Optimization via Availability Prediction (SOAP) for Reconfigurable Space Computing Quinn Martin Alan George.

Complex Upset Mitigation Applied to a Re-Configurable Embedded Processor EEL 6935 Lu Hao Wenqian Wu.

1 Fault Tolerant FPGA Co-processing Toolkit Oral defense in partial fulfillment of the requirements for the degree of Master of Science 2006 Oral defense.

ICAP CONTROLLER FOR HIGH-RELIABLE INTERNAL SCRUBBING Quinn Martin Steven Fingulin.

Simulation Fault-Injection & Software Fault-Tolerance

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

FAULT TOLERANCE IN FPGA BASED SPACE-BORNE COMPUTING SYSTEMS Niharika Chatla Vibhav Kundalia

DC/DC Switching Power Converter with Radiation Hardened Digital Control Based on SRAM FPGAs F. Baronti 1, P.C. Adell 2, W.T. Holman 2, R.D. Schrimpf 2,

Reconfigurable Computers in Space: Problems, Solutions and Future Directions Neil W. Bergmann, Anwar S. Dawood CRC for Satellite Systems Queensland University.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.

Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.

Configurable System-on-Chip: Xilinx EDK

Memory Organization.

University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

The University of New Hampshire InterOperability Laboratory Introduction To PCIe Express © 2011 University of New Hampshire.

A comprehensive method for the evaluation of the sensitivity to SEUs of FPGA-based applications A comprehensive method for the evaluation of the sensitivity.

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Fault-Tolerant Systems Design Part 1.

MAPLD 2005/202 Pratt1 Improving FPGA Design Robustness with Partial TMR Brian Pratt 1,2 Michael Caffrey, Paul Graham 2 Eric Johnson, Keith Morgan, Michael.

Swankoski MAPLD 2005 / B103 1 Dynamic High-Performance Multi-Mode Architectures for AES Encryption Eric Swankoski Naval Research Lab Vijay Narayanan Penn.

Synthesis Of Fault Tolerant Circuits For FSMs & RAMs Rajiv Garg Pradish Mathews Darren Zacher.

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

EEE440 Computer Architecture

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.

CprE 458/558: Real-Time Systems

ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.

Wang-110 D/MAPLD SEU Mitigation Techniques for Xilinx Virtex-II Pro FPGA Mandy M. Wang JPL R&TD Mobility Avionics.

Fault-Tolerant Systems Design Part 1.

The concept of RAID in Databases By Junaid Ali Siddiqui.

By Fernan Naderzad.  Today we’ll go over: Von Neumann Architecture, Hardware and Software Approaches, Computer Functions, Interrupts, and Buses.

Dr Mohamed Menacer College of Computer Science and Engineering, Taibah University CE-321: Computer.

Aerospace Conference ‘12 A Framework to Analyze, Compare, and Optimize High-Performance, On-Board Processing Systems Nicholas Wulf Alan D. George Ann Gordon-Ross.

1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

Digital Circuits Introduction Memory information storage a collection of cells store binary information RAM – Random-Access Memory read operation.

1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.

A4 1 Barto "Sequential Circuit Design for Space-borne and Critical Electronics" Dr. Rod L. Barto Spacecraft Digital Electronics Richard B. Katz NASA Goddard.

بسم الله الرحمن الرحيم MEMORY AND I/O.

Paper by F.L. Kastensmidt, G. Neuberger, L. Carro, R. Reis Talk by Nick Boyd 1.

CS203 – Advanced Computer Architecture Dependability & Reliability.

Chandrasekhar 1 MAPLD 2005/204 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan.

CFTP ( Configurable Fault Tolerant Processor )

SEU Mitigation Techniques for Virtex FPGAs in Space Applications

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Chapter 8: Main Memory.

RAID RAID Mukesh N Tekwani

Fault Tolerance Distributed Web-based Systems

Overview of Computer Architecture and Organization

Overview of Computer Architecture and Organization

RECONFIGURABLE NETWORK ON CHIP ARCHITECTURE FOR AEROSPACE APPLICATIONS

Hardware Assisted Fault Tolerance Using Reconfigurable Logic

RAID RAID Mukesh N Tekwani April 23, 2019

Seminar on Enterprise Software

Presentation transcript:

with Scott Arnold & Ryan Nuzzaci An Adaptive Fault-Tolerant Memory System for FPGA- based Architectures in the Space Environment Dan Fay, Alex Shye, Sayantan Bhattacharya, and Daniel A. Connors

 Reconfigurability  Rapidly adapt to changing mission conditions and requirements  Multiple applications  Speed  High-performance, application specific computing power  Accomplish more data collection and experimentation in short-life satellites  Cost and availability  Commercially available (COTS) FPGAs can be used  Affordable since non-RADhard components can be used

 Radiation  Short term damage ▪ Single Event Upsets (SEUs) – Occurs when an energetic particle leaves behind a charge in the silicon lattice ▪ May cause faults that affect application execution or result data  Permanent damage ▪ Extensive radiation exposure can render all or part of a device unusable ▪ May severely limit lifetime of device in certain orbits  SRAM vs. EEPROM  Modern FPGAs use an SRAM-based memory to store the configuration  EEPROM memory is less susceptible to radiation upsets, but is no longer used in FPGAs for the configuration space

 Adaptable fault tolerance  Fault tolerance schemes incur significant penalties in logic utilization, memory utilization, power consumption, and heat dissipation  Adapt to varying radiation conditions ▪ High radiation – Remove non-essential logic and increase fault tolerance logic for more critical logic ▪ Low radiation – Decrease fault tolerant logic and increase processing logic  Partial reconfiguration (PR)  Part of an FPGA to be reconfigured without interrupting the rest of the logic  Benefits ▪ Reconfigure only the logic where errors have been detected ▪ Relocate functionality of permanent radiation damaged logic

Triple3 Redundant Spacecraft Systems (T3RSS)  Provides whole-system redundancy  Requires three FPGAs each with their own local memory  FPGAs are interconnected using dedicated, point-to- point links  Adapts system to different failure modes ▪ Partial failure of one or more FPGAs ▪ Complete failure of one or more FPGAs ▪ Complete failure of one or more memories  Triple Modular Redundancy (TMR) is used to triplicate all logic  PR is used to relocate functionality around hard errors and scrub areas where soft SEU errors occur

T3RSS System Design

 Challenges  Remote redundant memory requires high off-chip bandwidth  Must increase memory width or FPGA interconnect clock speed ▪ Difficult due to FPGA’s resource limitations ▪ Increasing memory width will dramatically increase I/O pin use ▪ Faster memory technologies (e.g. PCI-X, PCI Express, RapidIO and HyperTransport) require too much extra logic  Possible solution  Bandwidth reduction with strategies like distributed error checking, posted writes, caching, and shadow fault detection

 Implementing fault tolerance  Error detection/correction ▪ Single bit error detection can be accomplished with simple parity checking ▪ CRC or MD5 checksumming techniques can be used for more sophisticated error detection ▪ EEC can be used for error correcting  Redundancy ▪ Redundant Array of Independent Disks (RAID) techniques can be applies to external memory or FPGA internal BRAMs  Both redundancy and error detection/correction can be used simultaneously

 Applying memory system fault tolerance  Configure fault tolerance based on application’s requirements  Parts of the memory system may be more critical than others  Fault effects  Benign Fault – A transient fault which does not propagate to affect the correctness of an application  Silent Data Corruption (SDC) – A transient fault which goes undetected and propagates to corrupt program output  Detected Unrecoverable Error (DUE) – A transient fault which is detected without possibility of recovery

 Four different campaigns for injection of SEUs  Registers – Source and destination of instructions  BSS segment – Area for uninitialized global and static variables  DATA segment – Area for initialized global and static variables  STACK segment – where the stack is stored  1000 iterations for each benchmark  Intel Pin dynamic binary instrumentation tool for fault injection  Fault-injection results categorized as:  Correct – Valid correct output data and valid return code, Benign fault  Failed – Illegal operation performed, results in DUE  Abort – Invalid return code, results in DUE  Timeout – Program hangs, time-out circuitry resets causing DUE  Incorrect – Valid return code incorrect output data, results in SDC  Incorrect result is worst possible outcome

 OPB – On-chip Peripheral Bus  Implemented on a Virtex-II pro  OPB-OPB bridge  Snoop info to monitor  Other side connects to Memory and UART  OPB Monitor  Logs OPB bridge traffic  Counts accesses to memory range  Microblazes  Shared memory  Between 2 and 3 used

 Register vulnerability  Particularly high compared to memory  Frequent usage  Use in multiple computations  BSS errors  Typically Seldom do faults propagate to errors  Notable exception in mm due to the large data structures

 Data memory section has almost uniform distribution  Stack memory shows selected applications have higher vulnerability  What does this all mean?  Motivates the use of an adaptive memory system  Customizable to the native characteristics and diverse workload

 Large variations  Read and write traffic  Overtime in for each benchmark  Shows problem with providing  Low-latency Memory  fault- tolerant redundancy  Possible to not meet real time constraints, while providing FT

 Effects of 4KB I-cache  Extremely effective in reducing read BRAM traffic  Increased write traffic  FIR filters shows significant speed increase  4KB D-cache  Positive effect of FIR  Increases amount memory accesses  Both  Increases through-put of generated data  Application of third Microblaze  Increases reads by 25%  Decrease in overall system performance

 Conclusions  Presented the T3RSS space hardware system  Provided motivation for a needed Adaptive distributed memory FT strategy  Emphasized the importance of reducing off-chip traffic  Porting fault susceptable segments off chip it reduces the off-chip traffic  Future Work  Implementing and testing new FT memory systems  Overall performance of off-chip and on-chip FT techniques  Study changes in wake of modified environmental conditions  Review  Scott: Not a great paper, More explanation needed in results to back conclusions, poorly defined terminology through-out.