Application Level Fault Tolerance and Detection

Slides:



Advertisements
Similar presentations
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Advertisements

Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
Memory-Based Recommender Systems : A Comparative Study Aaron John Mani Srinivasan Ramani CSCI 572 PROJECT RECOMPARATOR.
Toward Energy-Aware Software-Based Fault Tolerance in Real-Time Systems Osman S. Unsal, Israel Koren, C. Mani Krishna Architecture and Real-Time Systems.
Fraud Detection Experiments Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud First Union Credit Card.
Paradigms for Process Interaction ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Architecture and Real Time Systems Lab University of Massachusetts, Amherst An Application Driven Reliability Measures and Evaluation Tool for Fault Tolerant.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Architecture and Real Time Systems Lab University of Massachusetts, Amherst I Koren and C M Krishna Electrical and Computer Engineering University of Massachusetts.
Preprocessing Input Data to Augment Fault Tolerance in Space Applications Jayakrishnan K. Nair Zahava Koren Israel Koren C. Mani Krishna Architecture and.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
System/Software Testing
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts.
CprE 458/558: Real-Time Systems
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
A Fault-Tolerant Scheduling Algorithm for Real-Time Periodic Tasks with Possible Software Faults Ching-Chih Han, Kang G. Shin, and Jian Wu.
Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Filter Creation for Application Level Fault Tolerance and Detection Eric Ciocca, Israel Koren, C.M. Krishna ECE Department, UMass Amherst.
Week#3 Software Quality Engineering.
Overview Parallel Processing Pipelining
Hardware & Software Reliability
Unit 3 Hypothesis.
Selective Code Compression Scheme for Embedded System
UNIVERSITY OF MASSACHUSETTS Dept
Mean Value Analysis of a Database Grid Application
Life Cycle Models PPT By :Dr. R. Mall.
ACCURACY IN PERCENTILES
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
Application Level Fault Tolerance and Detection
Fault-tolerant Control System Design and Analysis
Agenda Review homework Lecture/discussion Week 10 assignment
Fault Injection: A Method for Validating Fault-tolerant System
On Spatial Joins in MapReduce
Fault Tolerance Distributed Web-based Systems
Soft Error Detection for Iterative Applications Using Offline Training
Using Baseline Data in Quality Problem Solving
Energy Efficient Scheduling in IoT Networks
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
TECHNICAL SEMINAR PRESENTATION
Active replication for fault tolerance
An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.
Baisc Of Software Testing
COMP60621 Fundamentals of Parallel and Distributed Systems
Software Verification and Validation
Department of Electrical Engineering
Department of Electrical Engineering
Software Verification and Validation
Chavit Denninnart, Mohsen Amini Salehi and Xiangbo Li
ECE 753: FAULT-TOLERANT COMPUTING
Non-parametric Filters: Particle Filters
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Software Verification and Validation
Non-parametric Filters: Particle Filters
Autocorrelation MS management.
Retrieval Performance Evaluation - Measures
COMP60611 Fundamentals of Parallel and Distributed Systems
Abstractions for Fault Tolerance
Lecture 16. Classification (II): Practical Considerations
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Approximate Mean Value Analysis of a Database Grid Application
Presentation transcript:

Application Level Fault Tolerance and Detection Principal Investigators: C. Mani Krishna Israel Koren Graduate Students: Diganta Eric Janhavi Osman Vijay Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003

What is ALFTD? Application Level Fault Tolerance and Detection ALFTD complements existing system or algorithm level fault tolerance by leveraging information available only at the application level Using such application level semantic information significantly reduces the overall cost providing fault tolerance ALFTD may be used alone or to supplement other fault detection schemes ALFTD is scalable Error overhead can be traded off with invested time overhead for fault tolerance Application Level Fault Tolerance and Detection

ALFTD Overview Application Level Fault Tolerance and Detection allows for system survival of both data and system (instruction/hardware) faults. System faults cause a process to eventually cease functioning Data faults cause a process to continue running with incorrect results ALFTD has been implemented into OTIS to determine its feasibility as a fault detection and tolerance method for REE applications OTIS has two sets of related output data, the temperature and emissivity Experiments have focused mostly on the temperature output Application Level Fault Tolerance and Detection

OTIS Structure OUTPUT M 1. MPI Starts MPI S 2. MPI Starts Slave and 5. Slave Output to File OUTPUT M S 2. MPI Starts Slave and master processes 3. Master sends tasks MPI 1. MPI Starts 4. Slave Calculations Application Level Fault Tolerance and Detection

OTIS’ Work Distribution OTIS’ dynamic workload distribution allows it to compensate for system faults Work originally partitioned for a failed processor is instead taken by the remaining processes OTIS does not compensate for data faults As long as the work is completed, there is no measure of correctness OTIS does not consider deadline repercussions Application Level Fault Tolerance and Detection

OTIS Fault Cases Application Level Fault Tolerance and Detection

ALFTD OTIS Structure OUTPUT 5. Slave Output to File? ? 2. MPI Starts Slave and master processes, primary and secondary M S2 P1 S1 P3 S3 P2 4. Slave Calculations 3. Master sends tasks MPI 1. MPI Starts Application Level Fault Tolerance and Detection

Secondaries in OTIS The secondary required for ALFTD is implemented to be functionally similar to the primary Secondary scaling occurs through resolution reduction OTIS’ “natural” data input exhibits spatial locality Points not directly calculated can be approximately estimated using interpolation between calculated points Secondary processes have been tested at 20%-50% of the primary calculation overhead While 50% affords better quality, 20% has less overhead Application Level Fault Tolerance and Detection

Example of Secondary Resolution 100% Secondary Resolution 50% Secondary Resolution 33% Secondary Resolution 25% Secondary Resolution (ALFTD Compensation for 10 rows in a sample dataset) Application Level Fault Tolerance and Detection

ALFTD Benefit Application Level Fault Tolerance and Detection

ALFTD Benefit (cont’d) Application Level Fault Tolerance and Detection

Fault Detection When to run the secondary, and when to use the secondary output, is determined by output filters Output filters are created to check for application-specific trends in data Aberrations from normal data characteristics can be considered to be the product of potentially faulty processes OTIS relies on natural temperature characteristics to detect potentially faulty data Spatial Locality: temperature changes gradually over small areas Absolute Bounds: temperature should not exceed certain values Application Level Fault Tolerance and Detection

Data Sets Three data sets were chosen for their interesting characteristics “Blob” “Stripe” “Spots” Broad, unchanging areas with dark spots Relatively undynamic except for one “stripe” Turbulent spots may defy “spatial locality” predictions Application Level Fault Tolerance and Detection

Data Frequency (Values) Application Level Fault Tolerance and Detection

Data Frequency (Spatial Locality) Application Level Fault Tolerance and Detection

Validation Through Secondaries When the primary deadline is hit, rows are re-delegated to the secondaries if (and only if): The primary has returned results for that row suspected to be faulty The secondary results can be used to decide whether the results are indeed faulty A particular row was never successfully calculated The secondary results can be immediately used in place of the missing primary results Application Level Fault Tolerance and Detection

Validation Through Secondaries (cont’d) After the secondary has been run to verify a primary’s results, the “better” data is chosen according to the following logic grid: Primary Faultless Ambiguous Faulty Secondary Primary* Secondary Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Spots” Fault Tolerance with injected faults in “Spots” Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Spots” (cont’d) Fault-Free Output Faulty Output 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead ALFTD-corrected faulty output Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Spots” (cont’d) Difference Plots – faulty output versus faultless output No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead No Error Max Error Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Blob” Fault Tolerance with injected faults in “Blob” Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Blob” (cont’d) Fault-Free Output Faulty Output 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead ALFTD-corrected faulty output Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Blob” (cont’d) Difference Plots – faulty output versus faultless output No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead No Error Max Error Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Stripe” Fault Tolerance with injected faults in “Stripe” Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Stripe”(cont’d) Fault-Free Output Faulty Output 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead ALFTD-corrected faulty output Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Stripe”(cont’d) Difference Plots – faulty output versus faultless output No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead No Error Max Error Application Level Fault Tolerance and Detection

Emissivity Data Emissivity is loosely proportional to temperature data Emissivity exhibits spatial locality Emissivity has natural bounds of expected data <0.5 - Faulty >1.0 - Faulty Natural Metal ~0.5 Rock ~0.8 - ~0.95 Vegetatation, Water ~1.0 Application Level Fault Tolerance and Detection

Emissivity Data (cont’d) Emissivity does not exhibit the same data “closeness” as temperature output This makes it very difficult to distinguish faulty from non-faulty data Luckily, faults present in temperature output are easily detected, and reflect faults in emissivity output. Emissivity does not have per-pixel independence of calculation Dependence on the correctness of neighboring pixels makes resolution reduction a viable, but not the best, method for secondary reduction Application Level Fault Tolerance and Detection

Data Frequency (Emissivity Values) Application Level Fault Tolerance and Detection

Conclusion ALFTD has already shown to be a worthwhile alternative to full redundancy Improvements on the scheme will increase fault coverage and decrease secondary calculation overhead in both the emissivity and temperature outputs OTIS, as a general matrix-based, master/slave program is a springboard to other, similar programs (e.g., NGST) ALFTD as a fault-detection scheme will continue to be effective in programs which exhibit “natural” output Application Level Fault Tolerance and Detection

Thank You! Application Level Fault Tolerance and Detection

Relative Error Calculation Error in OTIS output is calculated relative to a faultless “template” The average relative error is the average of all relative errors of the entire output Faulty value = f(x,y) Faultless value = F(x,y) Error = Application Level Fault Tolerance and Detection