Probabilistic Model-Driven Recovery in Distributed Systems Kaustubh R. Joshi, Matti A. Hiltunen, William H. Sanders, and Richard D. Schlichting May 2, 2012 Presented by Weiwei Qiu
Background Approaches for high availability are typically based on the combination of redundancy and human operators’ detection and repairations. Automating recovery is challenging in practice. ▫inaccurate fault diagnosis ▫poor fault localization ▫false positives ▫action selection 2
Objective Present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed system. ▫using a theoretically well-founded model- based mechanism for automated failure detection, diagnosis, and recovery ▫combining the recovery actions with diagnosis ▫detect when a problem is beyond its diagnosis and recovery capabilities 3
Approach Overview Diagnose system problems using the output of any existing monitors and choose the recovery actions that are most likely to restore the system to a proper state at minimum cost. ▫determine which combinations of component faults can occur in the system in a short window of time (fault hypothesis); 4
Approach Overview (cont.) ▫specify the coverage of each monitor m in the system with regard to each fault hypothesis; ▫specify the effects of each recovery action according to how it modifies the system state and fault hypothesis 5
Motivating Example 6 Enterprise Messaging Network (EMN)
Simplified EMN configuration: implements a Company Object Lookup(COL) System
System Model Fault hypothesis ▫A fault hypothesis is a Boolean expression that, when true, indicates the presence of one or more faults in the system. 8 Example: Down, Crash, Value Down(r): host r has crashed Crash(c): component c crashed Value(c): component c is alive but does not provide correct service.
Monitors 9 Each monitor returns true if it suspects a fault, and false otherwise A system may include a variety of monitoring techniques including: ▫Heartbeat-based monitors ▫Test-based monitors ▫End-to-end monitors ▫Error logs ▫Statistical monitors
Monitor Coverage monitor coverage, P[m|h], represents the probability that monitor m will return true given that fault hypothesis is true. 10
Recovery Actions The application-specific recovery actions A provide the only way for the controller to change the truth value of fault hypotheses. An action a is specified in terms of its “fault hypothesis effect” function, mean duration a.t(h) and monitors invoked a.M 12 Examples:
Bayesian Diagnosis Let be the subset of monitors invoked in the current round Let denote the current output of monitor m, and be the current set of all monitor outputs The vector is the diagnosis vector 13
14 {Value(HG,S1,S2)} P[h]=1/3 P[om|Value(HG)]=1 P[om|Value(S1)]=1/4 P[om|Value(S2)]=1/4 p[h|Value(HG)]=2/3 p[h|Value(S1)]=1/6 p[h|Value(S2)]=1/6
Recovery Algorithm 15
Recovery Action Selection 16 Single-Step Lookahead (SSLRecover) ▫SSLRecover accepts a cost metric a.cost as input for each action; ▫greedily makes its choice by “looking” only one recovery action ahead ▫SSLRecover cannot use actions whose outcomes depend on the order in which they are applied
Recovery Action Selection (cont.) 17 Multistep Lookahead (MSLRecover) ▫Extended system model: ▫state model ▫recovery action a is represented by a pre- condition, and a state effect, in addition to the fault hypothesis effect ▫Optimal action selection: ▫Transform the system model to a Partially Observable Markov Decision Processes with cost criterion.
Automatic Recovery Architecture
Experimental Results (1) Availability under Fault Injection
Experimental Results (2) Recovery Benchmarks
Related Work System diagnosis sequential diagnosis error propagation analysis Bayesian models/ Hidden Markov Models Automatic recovery microreboots Markov decision theory learning repair strategies 21
Future Work Modeling limitations Not allow for transient faults Consider one fault hypothesis at a time Systems extensions additional monitoring and recovery mechanisms can be integrated into the framework automatically construct the coverage, action, and cost models capturing operator domain knowledge regarding the effect of recovery actions 22
Thank You !