Fault Tolerant Techniques for System on a Chip Devices Matthew French and J.P. Walters University of Southern California’s Information Sciences Institute.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

DSPs Vs General Purpose Microprocessors
Students:Guy Derry Gil Wiechman Instructor: Isaschar Walter Winter 2003 Students:Guy Derry Gil Wiechman Instructor: Isaschar Walter Winter 2003 טכניון.
Sana Rezgui 1, Jeffrey George 2, Gary Swift 3, Kevin Somervill 4, Carl Carmichael 1 and Gregory Allen 3, SEU Mitigation of a Soft Embedded Processor in.
Scrubbing Approaches for Kintex-7 FPGAs
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Complex Upset Mitigation Applied to a Re-Configurable Embedded Processor EEL 6935 Lu Hao Wenqian Wu.
ICAP CONTROLLER FOR HIGH-RELIABLE INTERNAL SCRUBBING Quinn Martin Steven Fingulin.
FAULT TOLERANCE IN FPGA BASED SPACE-BORNE COMPUTING SYSTEMS Niharika Chatla Vibhav Kundalia
DC/DC Switching Power Converter with Radiation Hardened Digital Control Based on SRAM FPGAs F. Baronti 1, P.C. Adell 2, W.T. Holman 2, R.D. Schrimpf 2,
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.
Fault-Tolerant Softcore Processors Part I: Fault-Tolerant Instruction Memory Nathaniel Rollins Brigham Young University.
International Workshop on Satellite Based Traffic Measurement Berlin, Germany September 9th and 10th 2002 TECHNISCHE UNIVERSITÄT DRESDEN Onboard Computer.
Operating Systems High Level View Chapter 1,2. Who is the User? End Users Application Programmers System Programmers Administrators.
Low Overhead Fault Tolerant Networking (in Myrinet)
Operating System Support Focus on Architecture
Configurable System-on-Chip: Xilinx EDK
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
Figure 1.1 Interaction between applications and the operating system.
Performance Analysis of Processor Characterization Presentation Performed by : Winter 2005 Alexei Iolin Alexander Faingersh Instructor:
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Computer Organization and Architecture
1 Chapter 14 Embedded Processing Cores. 2 Overview RISC: Reduced Instruction Set Computer RISC-based processor: PowerPC, ARM and MIPS The embedded processor.
CprE 458/558: Real-Time Systems
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
With Scott Arnold & Ryan Nuzzaci An Adaptive Fault-Tolerant Memory System for FPGA- based Architectures in the Space Environment Dan Fay, Alex Shye, Sayantan.
S1.6 Requirements: KnightSat C&DH RequirementSourceVerification Source Document Test/Analysis Number S1.6-1Provide reliable, real-time access and control.
MAPLD 2005 Anthony Lai, Radiation Tolerant Computer Design.
DCH Requirements b Process at a rate fast enough to maintain all data storage and command handling tasks. b Have sufficient storage space to hold the OS,
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
Somervill RSC 1 125/MAPLD'05 Reconfigurable Processing Module (RPM) Kevin Somervill 1 Dr. Robert Hodson 1
P173/MAPLD 2005 Swift1 Upset Susceptibility and Design Mitigation of PowerPC405 Processors Embedded in Virtex II-Pro FPGAs.
Architectural Optimizations Ed Carlisle. DARA: A LOW-COST RELIABLE ARCHITECTURE BASED ON UNHARDENED DEVICES AND ITS CASE STUDY OF RADIATION STRESS TEST.
MAPLD 2005/202 Pratt1 Improving FPGA Design Robustness with Partial TMR Brian Pratt 1,2 Michael Caffrey, Paul Graham 2 Eric Johnson, Keith Morgan, Michael.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Memory. Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation.
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation Mrs. Shazia Maqbool and Dr. Craig I Underwood Maqbool 1 MAPLD 2005/P181.
Wang-110 D/MAPLD SEU Mitigation Techniques for Xilinx Virtex-II Pro FPGA Mandy M. Wang JPL R&TD Mobility Avionics.
Final Presentation DigiSat Reliable Computer – Multiprocessor Control System, Part B. Niv Best, Shai Israeli Instructor: Oren Kerem, (Isaschar Walter)
Processor Architecture
Copyright © Clifford Neuman - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture notes Dr.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Evaluating Logic Resources Utilization in an FPGA-Based TMR CPU
1 CzajkowskiMAPLD 2005/138 Radiation Hardened, Ultra Low Power, High Performance Space Computer Leveraging COTS Microelectronics With SEE Mitigation D.
Greg Alkire/Brian Smith 197 MAPLD An Ultra Low Power Reconfigurable Task Processor for Space Brian Smith, Greg Alkire – PicoDyne Inc. Wes Powell.
Somervill RSC 1 125/MAPLD'05 Reconfigurable Processing Module (RPM) Kevin Somervill 1 Dr. Robert Hodson 1
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Xilinx V4 Single Event Effects (SEE) High-Speed Testing Melanie D. Berg/MEI – Principal Investigator Hak Kim, Mark Friendlich/MEI.
Computer System Structures
Homework Reading Machine Projects Labs
SEU Mitigation Techniques for Virtex FPGAs in Space Applications
Introduction to Operating System (OS)
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Hyperthreading Technology
RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
A High Performance SoC: PkunityTM
Upset Susceptibility and Design Mitigation of
Single Event Upset Simulation
CSE451 Virtual Memory Paging Autumn 2002
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Co-designed Virtual Machines for Reliable Computer Systems
OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.
University of Wisconsin-Madison Presented by: Nick Kirchem
Presentation transcript:

Fault Tolerant Techniques for System on a Chip Devices Matthew French and J.P. Walters University of Southern California’s Information Sciences Institute MAPLD September 3 rd, 2009

Outline Motivation Limitation of existing approaches Requirements analysis SoC Fault Tolerant Architecture Overview SpaceCube Summary and Next Steps 2

Motivation General trend well established COTS well outpacing RHBP technology while able to provide some of the radiation tolerance Trend well established for FPGAs Starting to see in RISC processors as well Already flying FPGAs, FX family provides PowerPC for ‘free’ 3 ProcessorDhrystone MIPs Mongoose V8 RAD RAD Virtex 4 PowerPCs (2) only 900 Virtex 5 FX (2) PowerPC 440 2,200 Mongoose V RAD RAD

Processing Performance PowerPC Hard Core 1,100 (V5) 5-stage pipeline 16 KB Inst, data cache 2 per chip Strong name recognition, legacy tools, libraries etc Does not consume many configurable logic resources MicroBlaze Soft Core 280 (V5) 5-stage pipeline 2-64 KB cache Variable per chip —Debug module (MDM) limited to 8 Over 70 configuration option —Configurable logic consumption highly variable 4

Chip Level Fault Tolerance V4FX and V5FX are heterogeneous System on a Chip architectures Embedded PowerPCs Multi-gigabit transceivers Tri-mode Ethernet MACs State of new components not fully visible from the bitstream Configurable attributes of primitives present Much internal circuitry not visible Configuration scrubbing Minimal protection Triple Modular Redundancy Not feasible, < 3 of these components Bitstream Fault Injection Can’t reach large numbers of important registers 5

Existing Embedded PPC Fault Tolerance Approaches Quadruple Modular Redundancy 2 Devices = 4 PowerPCs Vote on result every clock cycle Fault detection and correction ~300% Overhead 6 Can we do better traditional redundancy techniques? New fault tolerance techniques and error insertion methods must be researched. Voter Dual Processor Lock Step Single device solution Error detection only Checkpointing and Rollback to return to last known safe state 100% Overhead Downtime while both processors rolling back Checkpoint and Rollback Controller

Mission Analysis Recent trend at MAPLD in looking at application level fault tolerance Brian Pratt, “Practical Analysis of SEU-induced Errors in an FPGA-based Digital Communications System” Met with NASA GSFC to analyze application and mission needs In depth analysis of 2 representative applications Synthetic Aperture Radar Hyperspectral Imaging Focus on scientific applications, not mission critical Assumptions LEO - ~10 Poisson distributed upsets per day Communication downlink largest bottleneck Ground corrections can mitigate obvious, small errors 7

Upset Rate vs Overhead Redundancy schemes add tremendous amount of overhead for non-critical applications Assume 10 errors per day (high) PowerPC 405 – x 10^13 clock cycles per day % of clock cycles with no faults TMR : x 10^13 wasted cycles QMR : x 10^14 wasted cycles DMR : x 10^13 wasted cycles Processor scaling is leading to TeraMIPs of overhead for traditional fault tolerance methods 8 Is there a more flexible way to invoke fault tolerance ‘on-demand’?

Communication Links Downlink bottlenecks Collection Stations not in continuous view throughout orbit Imager / Radar collection rates >> downlink bandwidth New sensors (eg Hyperspectral Imaging) increasing trend Result Satellites collect data until mass storage full, then beam down data Impact Little need to maintain real time processing rates with sensor collection Use mass storage as buffer Allows out of order execution 9 NASA’s A-Train. Image courtesy NASA Collection Station

Science Application Analysis Science applications tend to be streaming in nature Data flows through processing stages Little iteration or data feedback loops Impact Very little ‘state’ of application needs to be protected Protect constants, program control, and other feedback structures to mitigate persistent errors Minimal protection of other features assuming single data errors can be corrected in ground based post processing 10 Global Init Record Init FFT Multiply IFFT File I/O SAR persistent state: FFT and Filter Constants, dependencies, etc. ~264KB File I/O SAR Dataflow

Register Utilization Analyzed compiled SAR code register utilization for PowerPC405 Green – Sensitive register Blue – Not sensitive Grey – Potentially sensitive if OS used Mitigation routines can be developed for some registers TMR, parity, or flush Many registers read-only or not accessible (privileged mode) Can not rely solely on register-level mitigation

High Performance Computing Insights HPC community has similar problem 100’s to 1000’s of nodes Long application run times (days to weeks) A node will fail over run time HPC community does not use TMR Too many resources for already large, expensive systems Power = $ HPC relies more on periodic checkpointing and rollback Can we adapt these techniques for embedded computing? Checkpoint frequency Checkpoint data size Available memory Real-time requirements 12

Fault Tolerance System Hierarchy 13 Register Level Mitigation (TMR, EDAC) Sub-system Level Mitigation (Checkpointing and rollback, Scheduling, Configuration Scrubbing) Application Level Mitigation (Instruction level TMR, Cache Flushing, BIST, Control Flow Duplication) Increasing reaction time Increasing Fault Coverage

NASA Goddard Space Flight Center Enabling the “Reality of Tomorrow” Code 500 Overview 4/10/ Code 500 Overview SpaceCube SpaceCube Processor Slice Block Diagram

SoC Fault Tolerant Architecture FPGA 0 FPGA 2 FPGA N FPGA 0 Shared Memory Bus Rad-Hard Micro- controller Application Queue PowerPC 0 Event Queue T Timer Interrupt Scheduler Control Packets Heartbeat Packets PowerPC 1 PowerPC 2 PowerPC N Task Scheduler Memory Guard To Flight Recorder Access Table Implement SIMD model RadHard controller performs data scheduling and error handling Control packets from RadHard controller to PowerPCs Performs traditional bitstream scrubbing PowerPC node Performs health status monitoring (BIST) Sends health diagnosis packet ‘heartbeats’ to RadHard controller

SIMD: No Errors 16 Virtex4 Packet Scheduling Heartbeat Monitoring Reboot / Scrub control Radhard PIC Virtex4 PowerPC CLB-based Accelerator SAR Frame 2SAR Frame 1SAR Frame 4 SAR Frame 3 Utilization approaches 100% -Slightly less due to checking overheads

SIMD: Failure Mode 17 Virtex4 Packet Scheduling Heartbeat Monitoring Reboot / Scrub control Radhard PIC Virtex4 PowerPC CLB-based Accelerator SAR Frame 2SAR Frame 1 SAR Frame 4 SAR Frame 3 If a node fails, PIC scheduler sends frame data to next available processor SAR Frame 3

System Level Fault Handling ‘SEFI’ events Detection: Heartbeat not received during expected window of time Mitigation: RadHard Controller reboots PowerPC, scrubs FPGA, or reconfigures device Non-SEFI event PowerPC tries to self-mitigate Cache and register flushing, reset SPR, reinitialize constants etc Flushing can be scheduled as in bitstream scrubbing Data frame resent to another live node while bad node offline Scheme sufficient for detecting if PowerPC is hung and general program control flow errors and exceptions Still requires redundancy of constants to avoid persistent errors 18

Theoretical Overhead Analysis Normal Operation —Sub-system mitigation  Heartbeat write time ~100usec  Frequency: 50 – 100ms  Theoretical overhead % —Application level mitigation  Cache flushing ~320usec  Frequency: 50 – 100ms  Theoretical overhead 0.3 – 0.6% —Register level mitigation  Selective TMR – highly application dependant —Total overhead: Goal < 5% —Subsystem throughput  SFTA ~1710 MIPs  DMR ~850 MIPs  QMR ~450 MIPs 19 Error recovery mode —System utilization when one node rebooting?  75% - PowerPC overhead ~= 70% of application peak performance  Set ‘real-time’ performance to account for errors  Still higher system throughput than DMR, QMR —Expected reboot time?  With Linux: ~ 2 minutes  Without OS: < 1 minutes

Status Initial application analysis complete Implementing architecture Using GDB, bad instructions to insert errors Stay tuned for results Questions? 20