Download presentation
Presentation is loading. Please wait.
Published byClement Hudson Modified over 9 years ago
1
Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation Mrs. Shazia Maqbool and Dr. Craig I Underwood Maqbool 1 MAPLD 2005/P181
2
Overview Context Motivation Mitigation Scheme Top Level Description Protocol Gets Defined Test Bed Experimental Results Conclusions Maqbool 2 MAPLD 2005/P181
3
Single Event Functional Interrupts (SEFIs) A type of anomaly in microcircuits caused by a single ion strike Occurs in sensitive cross-section of the device User doesn’t have direct access to fault location Signatures An upset rate higher than expected Non responding device In a communication network SEFI is an event, which stops communication Variations in device current consumption During a SEFI, device is unavailable to the system Device is potentially recoverable Recovery involves resetting or power cycling System recovery requires restoring the device functionality followed by its state recovery Maqbool 3 MAPLD 2005/P181
4
Mitigation Levels Using Radiation Hardening Processes Built in Fault Tolerance Features Incorporating Redundancy within Device Error Detection And Corrections (EDACs) Redundancy Techniques, e.g. Voting, Lockstep etc. Configuration Scrubbing Data Handling Networks Device Level Unit Level System/Architectural Level Maqbool 4 MAPLD 2005/P181
5
Motivation Space applications demand: more… Computational power Standardization Reusability but less… Mass, volume and power budget Cost Development time Candidate architectures are heavily based on state-of-the-art COTS technology Reliability Availability SEFIs and single event transients are becoming dominant radiation hazards A unit level approach has usually been considered for SEFI mitigation Maqbool 5 MAPLD 2005/P181
6
System Architecture A fast data network interlinks all units Scalable Distributed Reusable A system level SEFI mitigation A diagnosis and recovery (DAR) packet from each unit acts as an indicator of health status for the unit The supervisor intervenes when a packet does not arrive or it does not match expectation Maqbool 6 MAPLD 2005/P181
7
Why a System-Level Approach Cost-effective Adaptable Reusable Power cycling requirements associated with SEFIs, demands for an external entity to hold state data and to initiate a recovery procedure In case of a permanent failure, it can be switched off Supervisory functions, network and configuration management can be combined Maqbool 7 MAPLD 2005/P181
8
On-Board Computer Possible sources of fault Processor Memory Network interface Processor EDAC Memory Interface FPGA OPC The OBC Subsystem Over-Current Protection Circuitry (OPC) Required underlying mitigations EDAC OPC Maqbool 8 MAPLD 2005/P181
9
SEFI Signatures Maqbool 9 MAPLD 2005/P181
10
Supervisory Protocol Two Types of packets Screech Packet Diagnosis And Recovery (DAR) Packet DAR task Perform Testing of the Processor Collects error count of the memory unit Updates state data Current consumption of the OBC module will be monitored System ID LengthFlags Diagnostic health data/Screech data Maqbool 10 MAPLD 2005/P181
11
Diagnosis And Recovery (DAR) Packet Flow SODARP Marker Start Sampling Current Perform Test DAR Packet Received Compare with Stored Values Enable Interrupts Collect SEU Count Send DAR Packet Waiting for Supervisor Response Command to Update Program Memory Update Memory OBC-ProcessorOBC- Interface FPGASupervisorCode Store DAR Process Starts Disable Interrupts Collect Current Value Maqbool 11 MAPLD 2005/P181
12
Recovery Method Fault TypeRecovery Procedure ScreechReload program memory Packet time-out (Network problems)Next Slide Packet time_out (Processor Problem) Next Slide Current consumption variationsPower cycle and reload memory SEU count exceeding thresholdReload memory Test task result mismatchReset and reload memory Maqbool 12 MAPLD 2005/P181
13
Recovery Method (2) In case of a processor reset and power cycle, the OBC should be allowed sufficient time for reinitialization The supervisor needs to keep a record of recoveries applied Consecutive recovery cycles needs to be avoided Maqbool 13 MAPLD 2005/P181
14
Test Bed Demonstration of the synchronization protocol PC1 executes the OBC program PC2 executes the supervisor program Maqbool 14 MAPLD 2005/P181 Synchronization Scheme 1Synchronization Scheme 2
15
Synchronization Scheme 1 OBC program receives a packet, checks source, if it is from the supervisor program, it sends a packet to the FPGA FPGA sends the packet to the supervisor program on Ethernet UDP/IP Packet from the supervisor Ethernet RC 203 board passes it as it is to the OBC-program Parallel Port Packet on Ethernet Configuration 1 UDP/IP Packet from the supervisor Ethernet RC 203 passes only data bytes OBC program receives data, it sends data bytes to the FPGA Parallel Port FPGA encodes received data into UDP/IP packet Packet on Ethernet Configuration 2 Maqbool 15 MAPLD 2005/P181
16
Time Measurement Method Maqbool 16 MAPLD 2005/P181 The ethereal graphical user interface (GUI) network protocol analyzer was used It displays time when a packet was captured It also displays IP source and destination, protocol type source and destination port for all captured packets. Selecting a packet from the list of captured packets shows total bytes captured on the network medium, Ethernet source and destination addresses, and number of data bytes in the packet. Time was measured from the moment it captures packet sent by the supervisor to the point when it captures a return packet from the OBC for synchronization scheme 1. For synchronization scheme 2, time was measured between two consecutive packets from the OBC.
17
Results (1) Maqbool 17 MAPLD 2005/P181 Data bytes Average time between two packets in a pair (Supervisor packet and OBC packet in response) (ms) Time required for 1 byte to travel through the system Time measured in n run with N bytes – time measured in n+1 run with N+K bytes divided by K ( s) 18101.053101.264-101.053/18 = 11.7 36101.264101.822-101.264/36 = 15.5 72101.822102.229-101.822/28 = 14.53 100102.229103.677-102.229/400 = 14.85 500103.677
18
Results (2) Maqbool 18 MAPLD 2005/P181 Data bytesExperimentTime measured between a supervisor packet and a response packet from the OBC ( s) 18Configuration 1 with RC200GetBlock Stall function 101053 18Configuration 1 with RC200GetBlock function 1201 18Configuration 2 with RC200GetBlock Stall function 101066 18Ping program270
19
Synchronization Scheme 2 Maqbool 19 MAPLD 2005/P181 Configuration for Synchronization Scheme 2 OBC sends data bytes to the interface FPGA Interface FPGA encodes data into UDP/IP packet and writes it on Ethernet Parallel Port Ethernet ExperimentTime measured Synchronization scheme 2: Time measured between two consecutive packets from OBC (18 data bytes) 261 s Synchronization scheme 1: OBC program crashed and reinitialized manually FPGA cleared using FTU facility and OBC program reinitialized manually Time between last OBC packet prior to fault and first packet after recovery 12 s, 299 ms and 644 s 15 s, 339ms and 501 s Synchronization scheme 2: OBC program crashed and reinitialized manually FPGA cleared using FTU facility and OBC program reinitialized manually Time between last OBC packet prior to fault and first packet after recovery 11 s, 316 ms and 272 s 14 s, 971ms and 559 s
20
Conclusions A system-level approach has been presented to mitigate SEFIs in data handling architectures Upset detection is not straightforward, limits effectiveness of currently available mitigation techniques Increasing SEFI susceptibility in all major data handling device technologies A system level intelligent supervisor allows monitoring of a wide range of devices with minimal overhead Synchronization is straightforward Two synchronization schemes have been demonstrated Few simple experiments were performed to establish a time-out period for a packet from the OBC. Once this information was achieved, the system behaved as expected and a synchronized packet communication was established between the OBC and the supervisor programs In event of a SEFI, the supervisor program needs to wait until the OBC program is up again. Time-out for this wait period will depend on the recovery latency Maqbool 20 MAPLD 2005/P181
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.