Achieving self-healing in service delivery software systems by means of case- based reasoning Stefania Montani Cosimo Anglano Presented by Tony Schneider Pr
Introduction Background CBR Implementation Experiment / Cavy Results
Autonomic Systems Overview Background | CBR Implementation | Experiment / Cavy | Results Goal is to self-manage system System needs to exhibit ‣ Self-Configuration ‣ Self-Optimization ‣ Self-Protection ‣ Self-Healing
Self-Healing Background | CBR Implementation | Experiment / Cavy | Results “Service Delivery Systems” (SDS) ‣ Aimed at delivering 24/7 services These services prone to breakage ‣ Service failures ‣ Software, Hardware, Network ‣ Can’t handle manually ‣ Need to repair the system autonomously
Self-Healing Background | CBR Implementation | Experiment / Cavy | Results
Internalization ‣ The Self-Healing Engine is integrated with the software ‣ Not extendable ‣ Depends on specific applications Externalization ‣ Great for retrofitting current systems ‣ Allows a general method for SDS self-healing
Self-Healing Background | CBR Implementation | Experiment / Cavy | Results Problems with current approach ‣ MAPE model assumes prior knowledge of the system ‣ Knowledge base is problematic ‣ Large, time consuming, & laborious ‣ Need to keep up-to-date Build the knowledge base automatically ‣ How?
Case-based Reasoning Background | CBR Implementation | Experiment / Cavy | Results Case-Based Reasoning (CBR) ‣ Uses previous experience for problem solving ‣ Retrieves similar cases to current problem ‣ Reuses past successful solutions ‣ Revises retrieved solution if necessary ‣ Retains current case
Case-based Reasoning Background | CBR Implementation | Experiment / Cavy | Results Case-base represents “knowledge” in the MAPE model ‣ Each case represents a previous problem and its solution ‣ Implicit versus Explicit knowledge ‣ Explicit: Rules & models ‣ Implicit: Unstructured & based on experience ‣ Implicit tends to be easier and more conducive to limited interaction
Case-based Reasoning Background | CBR Implementation | Experiment / Cavy | Results Cases are stored by identifying application features ‣ The problem ‣ Applied solution ‣ The outcome of the solution Prevents bottleneck present in other learning methods ‣ E.g., online reinforcement learning
Case-based Reasoning Background | CBR Implementation | Experiment / Cavy | Results CBR relies on large amounts of past cases Pros: ‣ Methods approve with time and experience ‣ Large systems are hosts to recurrent problems Cons ‣ Need to store the data ‣ Need to populate the knowledge base
Case-based Reasoning Background | CBR Implementation | Experiment / Cavy | Results To reiterate: CBR is a methodology designed to assist in the repair of failed systems Questions so far?
System Overview Background | CBR Implementation | Experiment / Cavy | Results SDS is treated as a black box ‣ Self-healing CBR is entirely external to the SDS ‣ Controls the health of the SDS ‣ Components of CBR reflected in MAPE ‣ Analysis Retrieval ‣ Planning Revise ‣ Knowledge Case base
System Overview: MAPE Revised Background | CBR Implementation | Experiment / Cavy | Results Old ModelRevised for CBR
System Overview: MAPE Revised Background | CBR Implementation | Experiment / Cavy | Results Four Additions ‣ Monitoring ‣ Case Preparation ‣ Service Restoration ‣ Repair Module
System Overview: MAPE Revised Background | CBR Implementation | Experiment / Cavy | Results Application Agnostic Portion ‣ Doesn’t rely on specific environment variables Application Specific Portion ‣ Relies on the data from the application Both ‣ Interface between the two layers The managed element is completely external to the healing system
System Overview Background | CBR Implementation | Experiment / Cavy | Results Assumptions ‣ Bad solutions have no effect on the SDS state. Likewise, good solutions don’t produce faults. ‣ Deadlines for producing case solutions aren’t fixed ‣ Every stored case has a unique solution ‣ No transient faults (occur only once) ‣ No intermittent faults (appear, disappear, then reappear again)
CBR Cycle: Retrieve - Reuse/Revise - Retain Background | CBR Implementation | Experiment / Cavy | Results Every stored case is representative of some past failure Need to find the case that approximates current failure Find the average distance between features d f (x, y) ‣ 1 if x or y are missing ‣ overlap(x, y) if f is a symbolic feature ‣ if f is a linear feature
CBR Cycle: Retrieve - Reuse/Revise - Retain Background | CBR Implementation | Experiment / Cavy | Results Apply retrieved case solutions in the order of the bset average ‣ Repeat for all found cases until the problem is solved ‣ Also covers cases with multiple solutions (just use best choice) What if no solution works? ‣ Ask a human
CBR Cycle: Retrieve - Reuse/Revise - Retain Background | CBR Implementation | Experiment / Cavy | Results Just saves the case to the knowledge base ‣ The problem ‣ The solution ‣ The outcome
Odds and Ends Background | CBR Implementation | Experiment / Cavy | Results System initialization ‣ Boot strap phase Prototyping ‣ Makes a general case out of several similar cases in case base ‣ Solves storage space problem ‣ Takes the implicit knowledge and creates explicit knowledge ‣ Used after base case has grown
CBR questions? Background | CBR Implementation | Experiment / Cavy | Results That wraps up the CBR portion. Any Questions?
Experimental Setup Background | CBR Implementation | Experiment / Cavy | Results Implemented CBR-based system using Java ‣ MySQL for the base case storage Used with an SDS testbed “Cavy” Cavy ‣ Configures, deploys, and operates SDS testbeds ‣ Framework that surrounds the healing engine ‣ Injects faults into test bed components
Cavy Components Background | CBR Implementation | Experiment / Cavy | Results Fault managers Diagnoser Service Monitor Integrator Repairer Injector
Cavy Components Background | CBR Implementation | Experiment / Cavy | Results Basically... ‣ The injector breaks the system ‣ The service monitor sees the fault ‣ The diagnoser finds a similar FS pair ‣ Interrogator receives the solution ‣ Repairer tries each solution until one works
Cavy Components Background | CBR Implementation | Experiment / Cavy | Results Cavy implements pieces of the self-healing architecture ‣ Interrogator: Application agnostic pieces ‣ Fault repairer: Application specific pieces ‣ Service monitor: Monitor ‣ Fault managers: Repair
The Experiment Background | CBR Implementation | Experiment / Cavy | Results Rubis ‣ Mimics eBay ‣ Two tiers ‣ Customers interact with web server on the first ‣ Database stored on the second ‣ Several services are tested ‣ Register, Browse, Sell, Home
The Experiment Background | CBR Implementation | Experiment / Cavy | Results Potential Rubis Failures (each can apply to either tier) ‣ Network Problems ‣ Configuration problems ‣ System restart 10 failure descriptors ‣ Boolean values ‣ Represent failed pieces of the system
Initial Base Case (constructed by a human) Background | CBR Implementation | Experiment / Cavy | Results Automatically generated case
Initial Base Case (constructed by a human) Background | CBR Implementation | Experiment / Cavy | Results Distances between current failure and base case
Second Case Background | CBR Implementation | Experiment / Cavy | Results
Results Background | CBR Implementation | Experiment / Cavy | Results Continued like this for 3 days ‣ Of 1016 cases, less than 11 needed human intervention Prototypes functioned correctly ‣ Reduced size of database ‣ Handled new faults with out human intervention ‣ Narrowed down the possible failures to 9 prototype cases ‣ Showed “complex” problems were just simultaneous simple problems
Future Work Use in real-world applications Working around the given assumptions Use of prototyping/generalization Combine CBR with other knowledge sources ‣ Combine CBR with some other methodology
Conclusion ‣ CBR a good solution to self-healing ‣ Repair procedure triggered by service failures ‣ No structured knowledge needed ‣ Worked well even with novel faults