Probabilistic Model-Driven Recovery in Distributed Systems Kaustubh R. Joshi, Matti A. Hiltunen, William H. Sanders, and Richard D. Schlichting May 2,

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

International Technology Alliance In Network & Information Sciences International Technology Alliance In Network & Information Sciences 1 A framework for.
Applied Algorithmics - week7
Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.
Outsourcing, subcontracting and COTS Tor Stålhane.
“I Don’t Need Enterprise Miner”
Background Reinforcement Learning (RL) agents learn to do tasks by iteratively performing actions in the world and using resulting experiences to decide.
Software Fault Injection for Survivability Jeffrey M. Voas & Anup K. Ghosh Presented by Alison Teoh.
SBSE Course 3. EA applications to SE Analysis Design Implementation Testing Reference: Evolutionary Computing in Search-Based Software Engineering Leo.
Boundary Value Testing A type of “Black box” functional testing –The program is viewed as a mathematical “function” –The program takes inputs and maps.
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
1 © 1998 HRL Laboratories, LLC. All Rights Reserved Construction of Bayesian Networks for Diagnostics K. Wojtek Przytula: HRL Laboratories & Don Thompson:
NetQuest: A Flexible Framework for Internet Measurement Lili Qiu Joint work with Mike Dahlin, Harrick Vin, and Yin Zhang UT Austin.
Watchdog Confident Event Detection in Heterogeneous Sensor Networks Matthew Keally 1, Gang Zhou 1, Guoliang Xing 2 1 College of William and Mary, 2 Michigan.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Constructing Reliable Software Components Across the ORB M. Robert Rwebangira Howard University Future Aerospace Science and Technology.
Experimental Design for Practical Network Diagnosis Yin Zhang University of Texas at Austin Joint work with Han Hee Song and Lili.
1 Software Testing Techniques CIS 375 Bruce R. Maxim UM-Dearborn.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 9 Functional Testing
Włodzimierz Funika, Filip Szura Automation of decision making for monitoring systems.
Intrusion and Anomaly Detection in Network Traffic Streams: Checking and Machine Learning Approaches ONR MURI area: High Confidence Real-Time Misuse and.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
Hypothesis testing is used to make decisions concerning the value of a parameter.
MAKING COMPLEX DEClSlONS
earthobs.nr.no Land cover classification of cloud- and snow-contaminated multi-temporal high-resolution satellite images Arnt-Børre Salberg and.
Fast Portscan Detection Using Sequential Hypothesis Testing Authors: Jaeyeon Jung, Vern Paxson, Arthur W. Berger, and Hari Balakrishnan Publication: IEEE.
Author: James Allen, Nathanael Chambers, etc. By: Rex, Linger, Xiaoyi Nov. 23, 2009.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
1 12. Course Summary Course Summary Distributed Database Systems.
Benk Erika Kelemen Zsolt
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
The Volcano Optimizer Generator Extensibility and Efficient Search.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
CprE 458/558: Real-Time Systems
Re-Configurable Byzantine Quorum System Lei Kong S. Arun Mustaque Ahamad Doug Blough.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Detecting Errors Using Multi-Cycle Invariance Information Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence,
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
© 2013 IBM Corporation A. Craik, C Black, P Doyle 4-Nov-14 Mincer: a distributed automated problem determination tool.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
NetQuest: A Flexible Framework for Large-Scale Network Measurement Lili Qiu University of Texas at Austin Joint work with Han Hee Song.
NTU & MSRA Ming-Feng Tsai
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Condition Testing. Condition testing is a test case design method that exercises the logical conditions contained in a program module. A simple condition.
1 Phase Testing. Janice Regan, For each group of units Overview of Implementation phase Create Class Skeletons Define Implementation Plan (+ determine.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Daphne Koller Introduction Motivation and Overview Probabilistic Graphical Models.
Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Week#3 Software Quality Engineering.
Software Defects Cmpe 550 Fall 2005
Lecture 1.31 Criteria for optimal reception of radio signals.
Condition Testing.
Learning Software Behavior for Automated Diagnosis
Course: Autonomous Machine Learning
Overview: Fault Diagnosis
The
Middleware for Fault Tolerant Applications
Mincer: a distributed automated problem determination tool
IPOG: A General Strategy for T-Way Software Testing
Presented By: Darlene Banta
Evolutionary Ensembles with Negative Correlation Learning
Optimization under Uncertainty
Presentation transcript:

Probabilistic Model-Driven Recovery in Distributed Systems Kaustubh R. Joshi, Matti A. Hiltunen, William H. Sanders, and Richard D. Schlichting May 2, 2012 Presented by Weiwei Qiu

Background Approaches for high availability are typically based on the combination of redundancy and human operators’ detection and repairations. Automating recovery is challenging in practice. ▫inaccurate fault diagnosis ▫poor fault localization ▫false positives ▫action selection 2

Objective Present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed system. ▫using a theoretically well-founded model- based mechanism for automated failure detection, diagnosis, and recovery ▫combining the recovery actions with diagnosis ▫detect when a problem is beyond its diagnosis and recovery capabilities 3

Approach Overview Diagnose system problems using the output of any existing monitors and choose the recovery actions that are most likely to restore the system to a proper state at minimum cost. ▫determine which combinations of component faults can occur in the system in a short window of time (fault hypothesis); 4

Approach Overview (cont.) ▫specify the coverage of each monitor m in the system with regard to each fault hypothesis; ▫specify the effects of each recovery action according to how it modifies the system state and fault hypothesis 5

Motivating Example 6 Enterprise Messaging Network (EMN)

Simplified EMN configuration: implements a Company Object Lookup(COL) System

System Model Fault hypothesis ▫A fault hypothesis is a Boolean expression that, when true, indicates the presence of one or more faults in the system. 8 Example: Down, Crash, Value Down(r): host r has crashed Crash(c): component c crashed Value(c): component c is alive but does not provide correct service.

Monitors 9 Each monitor returns true if it suspects a fault, and false otherwise A system may include a variety of monitoring techniques including: ▫Heartbeat-based monitors ▫Test-based monitors ▫End-to-end monitors ▫Error logs ▫Statistical monitors

Monitor Coverage monitor coverage, P[m|h], represents the probability that monitor m will return true given that fault hypothesis is true. 10

Recovery Actions The application-specific recovery actions A provide the only way for the controller to change the truth value of fault hypotheses. An action a is specified in terms of its “fault hypothesis effect” function, mean duration a.t(h) and monitors invoked a.M 12 Examples:

Bayesian Diagnosis Let be the subset of monitors invoked in the current round Let denote the current output of monitor m, and be the current set of all monitor outputs The vector is the diagnosis vector 13

14 {Value(HG,S1,S2)} P[h]=1/3 P[om|Value(HG)]=1 P[om|Value(S1)]=1/4 P[om|Value(S2)]=1/4 p[h|Value(HG)]=2/3 p[h|Value(S1)]=1/6 p[h|Value(S2)]=1/6

Recovery Algorithm 15

Recovery Action Selection 16 Single-Step Lookahead (SSLRecover) ▫SSLRecover accepts a cost metric a.cost as input for each action; ▫greedily makes its choice by “looking” only one recovery action ahead ▫SSLRecover cannot use actions whose outcomes depend on the order in which they are applied

Recovery Action Selection (cont.) 17 Multistep Lookahead (MSLRecover) ▫Extended system model: ▫state model ▫recovery action a is represented by a pre- condition, and a state effect, in addition to the fault hypothesis effect ▫Optimal action selection: ▫Transform the system model to a Partially Observable Markov Decision Processes with cost criterion.

Automatic Recovery Architecture

Experimental Results (1) Availability under Fault Injection

Experimental Results (2) Recovery Benchmarks

Related Work System diagnosis sequential diagnosis error propagation analysis Bayesian models/ Hidden Markov Models Automatic recovery microreboots Markov decision theory learning repair strategies 21

Future Work Modeling limitations Not allow for transient faults Consider one fault hypothesis at a time Systems extensions additional monitoring and recovery mechanisms can be integrated into the framework automatically construct the coverage, action, and cost models capturing operator domain knowledge regarding the effect of recovery actions 22

Thank You !