Download presentation
Presentation is loading. Please wait.
1
Application Level Fault Tolerance and Detection
Principal Investigators: C. Mani Krishna Israel Koren Graduate Students: Diganta Eric Janhavi Osman Vijay Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003
2
What is ALFTD? Application Level Fault Tolerance and Detection
ALFTD complements existing system or algorithm level fault tolerance by leveraging information available only at the application level Using such application level semantic information significantly reduces the overall cost providing fault tolerance ALFTD may be used alone or to supplement other fault detection schemes ALFTD is scalable Error overhead can be traded off with invested time overhead for fault tolerance Application Level Fault Tolerance and Detection
3
ALFTD Overview Application Level Fault Tolerance and Detection allows for system survival of both data and system (instruction/hardware) faults. System faults cause a process to eventually cease functioning Data faults cause a process to continue running with incorrect results ALFTD has been implemented into OTIS to determine its feasibility as a fault detection and tolerance method for REE applications OTIS has two sets of related output data, the temperature and emissivity Experiments have focused mostly on the temperature output Application Level Fault Tolerance and Detection
4
OTIS Structure OUTPUT M 1. MPI Starts MPI S 2. MPI Starts Slave and
5. Slave Output to File OUTPUT M S 2. MPI Starts Slave and master processes 3. Master sends tasks MPI 1. MPI Starts 4. Slave Calculations Application Level Fault Tolerance and Detection
5
OTIS’ Work Distribution
OTIS’ dynamic workload distribution allows it to compensate for system faults Work originally partitioned for a failed processor is instead taken by the remaining processes OTIS does not compensate for data faults As long as the work is completed, there is no measure of correctness OTIS does not consider deadline repercussions Application Level Fault Tolerance and Detection
6
OTIS Fault Cases Application Level Fault Tolerance and Detection
7
ALFTD OTIS Structure OUTPUT 5. Slave Output to File? ?
2. MPI Starts Slave and master processes, primary and secondary M S2 P1 S1 P3 S3 P2 4. Slave Calculations 3. Master sends tasks MPI 1. MPI Starts Application Level Fault Tolerance and Detection
8
Secondaries in OTIS The secondary required for ALFTD is implemented to be functionally similar to the primary Secondary scaling occurs through resolution reduction OTIS’ “natural” data input exhibits spatial locality Points not directly calculated can be approximately estimated using interpolation between calculated points Secondary processes have been tested at 20%-50% of the primary calculation overhead While 50% affords better quality, 20% has less overhead Application Level Fault Tolerance and Detection
9
Example of Secondary Resolution
100% Secondary Resolution 50% Secondary Resolution 33% Secondary Resolution 25% Secondary Resolution (ALFTD Compensation for 10 rows in a sample dataset) Application Level Fault Tolerance and Detection
10
ALFTD Benefit Application Level Fault Tolerance and Detection
11
ALFTD Benefit (cont’d)
Application Level Fault Tolerance and Detection
12
Fault Detection When to run the secondary, and when to use the secondary output, is determined by output filters Output filters are created to check for application-specific trends in data Aberrations from normal data characteristics can be considered to be the product of potentially faulty processes OTIS relies on natural temperature characteristics to detect potentially faulty data Spatial Locality: temperature changes gradually over small areas Absolute Bounds: temperature should not exceed certain values Application Level Fault Tolerance and Detection
13
Data Sets Three data sets were chosen for their interesting characteristics “Blob” “Stripe” “Spots” Broad, unchanging areas with dark spots Relatively undynamic except for one “stripe” Turbulent spots may defy “spatial locality” predictions Application Level Fault Tolerance and Detection
14
Data Frequency (Values)
Application Level Fault Tolerance and Detection
15
Data Frequency (Spatial Locality)
Application Level Fault Tolerance and Detection
16
Validation Through Secondaries
When the primary deadline is hit, rows are re-delegated to the secondaries if (and only if): The primary has returned results for that row suspected to be faulty The secondary results can be used to decide whether the results are indeed faulty A particular row was never successfully calculated The secondary results can be immediately used in place of the missing primary results Application Level Fault Tolerance and Detection
17
Validation Through Secondaries (cont’d)
After the secondary has been run to verify a primary’s results, the “better” data is chosen according to the following logic grid: Primary Faultless Ambiguous Faulty Secondary Primary* Secondary Application Level Fault Tolerance and Detection
18
Fault Tolerance Results: “Spots”
Fault Tolerance with injected faults in “Spots” Application Level Fault Tolerance and Detection
19
Fault Tolerance Results: “Spots” (cont’d)
Fault-Free Output Faulty Output 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead ALFTD-corrected faulty output Application Level Fault Tolerance and Detection
20
Fault Tolerance Results: “Spots” (cont’d)
Difference Plots – faulty output versus faultless output No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead No Error Max Error Application Level Fault Tolerance and Detection
21
Fault Tolerance Results: “Blob”
Fault Tolerance with injected faults in “Blob” Application Level Fault Tolerance and Detection
22
Fault Tolerance Results: “Blob” (cont’d)
Fault-Free Output Faulty Output 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead ALFTD-corrected faulty output Application Level Fault Tolerance and Detection
23
Fault Tolerance Results: “Blob” (cont’d)
Difference Plots – faulty output versus faultless output No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead No Error Max Error Application Level Fault Tolerance and Detection
24
Fault Tolerance Results: “Stripe”
Fault Tolerance with injected faults in “Stripe” Application Level Fault Tolerance and Detection
25
Fault Tolerance Results: “Stripe”(cont’d)
Fault-Free Output Faulty Output 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead ALFTD-corrected faulty output Application Level Fault Tolerance and Detection
26
Fault Tolerance Results: “Stripe”(cont’d)
Difference Plots – faulty output versus faultless output No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead No Error Max Error Application Level Fault Tolerance and Detection
27
Emissivity Data Emissivity is loosely proportional to temperature data
Emissivity exhibits spatial locality Emissivity has natural bounds of expected data <0.5 - Faulty >1.0 - Faulty Natural Metal ~0.5 Rock ~0.8 - ~0.95 Vegetatation, Water ~1.0 Application Level Fault Tolerance and Detection
28
Emissivity Data (cont’d)
Emissivity does not exhibit the same data “closeness” as temperature output This makes it very difficult to distinguish faulty from non-faulty data Luckily, faults present in temperature output are easily detected, and reflect faults in emissivity output. Emissivity does not have per-pixel independence of calculation Dependence on the correctness of neighboring pixels makes resolution reduction a viable, but not the best, method for secondary reduction Application Level Fault Tolerance and Detection
29
Data Frequency (Emissivity Values)
Application Level Fault Tolerance and Detection
30
Conclusion ALFTD has already shown to be a worthwhile alternative to full redundancy Improvements on the scheme will increase fault coverage and decrease secondary calculation overhead in both the emissivity and temperature outputs OTIS, as a general matrix-based, master/slave program is a springboard to other, similar programs (e.g., NGST) ALFTD as a fault-detection scheme will continue to be effective in programs which exhibit “natural” output Application Level Fault Tolerance and Detection
31
Thank You! Application Level Fault Tolerance and Detection
32
Relative Error Calculation
Error in OTIS output is calculated relative to a faultless “template” The average relative error is the average of all relative errors of the entire output Faulty value = f(x,y) Faultless value = F(x,y) Error = Application Level Fault Tolerance and Detection
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.