A Planning Based Approach to Failure Recovery in Distributed Systems Naveed Arshad Dennis Hiembigner, Alexander L. Wolf University of Colorado at Boulder.

Slides:



Advertisements
Similar presentations
High level QA strategy for SQL Server enforcer
Advertisements

Performance Testing - Kanwalpreet Singh.
Architecture Representation
Chapter 13 Managing Computer and Data Resources. Introduction A disciplined, systematic approach is needed for management success Problem Management,
Availability in Globally Distributed Storage Systems
Critical Software Security Through Replication and Virtualization A Research Proposal Dennis Edwards Sharon Simmons Arangamanikkannan Manickam.
Prioritizing User-session-based Test Cases for Web Applications Testing Sreedevi Sampath, Renne C. Bryce, Gokulanand Viswanath, Vani Kandimalla, A.Gunes.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Making Services Fault Tolerant
Distributed components
Lecturer: Sebastian Coope Ashton Building, Room G.18 COMP 201 web-page: Lecture.
Adaptive Systems – Graceful Degrading System Paul Li
The Phoenix Recovery System: Rebuilding from the ashes of an Internet catastrophe Flavio Junqueira, Ranjita Bhagwan, Keith Marzullo, Stefan Savage, and.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 30 Slide 1 Security Engineering.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 11 Slide 1 Architectural Design.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
SWE Introduction to Software Engineering
8. Fault Tolerance in Software
Establishing the overall structure of a software system
1 Software Testing and Quality Assurance Lecture 30 – Testing Systems.
MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett.
The middleware that makes real time integration a reality.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Architectural Design, Distributed Systems Architectures
Course Instructor: Aisha Azeem
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
Emergency Database Failover: Impacts & Recovery Plan
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 11 Slide 1 Architectural Design.
Chapter 7: Architecture Design Omar Meqdadi SE 273 Lecture 7 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
Architectural Design. Recap Introduction to design Design models Characteristics of good design Design Concepts.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Chapter 10 Architectural Design.
Architectural Design, Distributed Systems Architectures
Chapter Fourteen Windows XP Professional Fault Tolerance.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
Distributed Systems: Concepts and Design Chapter 1 Pages
A performance evaluation approach openModeller: A Framework for species distribution Modelling.
Architectural Design lecture 10. Topics covered Architectural design decisions System organisation Control styles Reference architectures.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
By: Anjaneya Datla. What are TDMA Systems? Time division multiple access (TDMA) is a channel access method for shared medium networks. Several users can.
© Drexel University Software Engineering Research Group (SERG) 1 A. E. Hassan and R. C. Holt A Reference Architecture for Web.
Yuhui Chen; Romanovsky, A.; IT Professional Volume 10, Issue 3, May-June 2008 Page(s): Digital Object Identifier /MITP Improving.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 11 Slide 1 Architectural Design.
CSC480 Software Engineering Lecture 10 September 25, 2002.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Slide 1 Security Engineering. Slide 2 Objectives l To introduce issues that must be considered in the specification and design of secure software l To.
Performance Testing Test Complete. Performance testing and its sub categories Performance testing is performed, to determine how fast some aspect of a.
Clever Framework Name MARCH 27, Meeting Agenda  Framework Overview  Prototype 1 Design Goals  Prototype 1 Demo  Prototype 2 Design Goals  Timeline.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
The Willow System Implementation John C. Knight University of Virginia Dennis Heimbigner University of Colorado Intrusion Tolerance Through Secure System.
Control-Theoretic Approaches for Dynamic Information Assurance George Vachtsevanos Georgia Tech Working Meeting U. C. Berkeley February 5, 2003.
Slide 1 Chapter 8 Architectural Design. Slide 2 Topics covered l System structuring l Control models l Modular decomposition l Domain-specific architectures.
1 Creating Situational Awareness with Data Trending and Monitoring Zhenping Li, J.P. Douglas, and Ken. Mitchell Arctic Slope Technical Services.
SRA 2016 – Strategic Research Challenges Design Methods, Tools, Virtual Engineering Jürgen Niehaus, SafeTRANS.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Chapter 5:Architectural Design l Establishing the overall structure of a software.
Dr D. Greer, Queens University Belfast ) Software Engineering Chapter 7 Software Architectural Design Learning Outcomes Understand.
1 Distributed Systems Architectures Distributed object architectures Reference: ©Ian Sommerville 2000 Software Engineering, 6th edition.
A statistical anomaly-based algorithm for on-line fault detection in complex software critical systems A. Bovenzi – F. Brancati Università degli Studi.
Consulting Services JobScheduler Architecture Decision Template
Maximum Availability Architecture Enterprise Technology Centre.
Software Design and Architecture
Security Engineering.
Maintaining software solutions
Class project by Piyush Ranjan Satapathy & Van Lepham
A Real-time Intrusion Detection System for UNIX
Evaluating Transaction System Performance
Middleware for Fault Tolerant Applications
Software Engineering with Reusable Components
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Presentation transcript:

A Planning Based Approach to Failure Recovery in Distributed Systems Naveed Arshad Dennis Hiembigner, Alexander L. Wolf University of Colorado at Boulder Workshop on Self Managed Systems (WOSS’04) Oct 31 st, 2004

Introduction Automated failure recovery in systems using dynamic reconfiguration and AI planning –Recover in minimum time (but not real-time) Target: component based heterogeneous distributed systems –Application level reconfiguration –Not OS or network level (yet)

Goals for Failure Recovery Automated process Minimize downtime Handle complex failures –Ripple effects of failures –Hard to anticipate the failed state Large number of possible failed states –Large number of recovered states

Approach (Sense-Plan-Act) Sensing –Determining if a failure has occurred Planning –Calculating the ripple effects –Devising a plan for failure recovery Acting –Executing the plan on the actual system

Planning Domain (Static) –Semantics of the System Initial State –Configuration of the system at the start (i.e. the failed state) Goal State –Configuration of the system at the end (i.e. the recovered state) Plan –Set of actions to get from the initial state to the goal state

An Example Normal Failed Affected Web Server Database Servlet Engine 2 Application Server Machine 1 Machine 2 Machine 3 Machine 4 Machine 5 Clients Servlet Engine 1 Application Server 1

A Failure Scenario Web Server Database Servlet Engine 2 Application Server Machine 1 Machine 2 Machine 3 Machine 4 Machine 5 Normal Failed Affected Clients Servlet Engine 1 Application Server 1

Calculating Ripple Effects Dependency model is used to dynamically calculate effects of component failure on other components Components are classified into three different kinds –Failed Components –Affected Components –Normal Components

Styles for Recovered States Explicit Recovered State –Stating a recovered state for the planner servletEngineWorking(servletengine1 machine2) Implicit Recovered State –Asking the planner to find a recovered state servletEngineWorking(servletengine1) All goal state specifications have significant amounts of implicit specification If not, then planner is not needed

Domain Specification Objects applicationserver machine webserver servletengine Predicates ServletEngineInstalled (servletEngine, machinename) ServletEngineStarted (servletEngine) ServletEngineWorking (servletEngine) machineFailed (machinename) ApplicationServerWorking (applicationServer) WebServerWorking (webserver)... Functions MachineRAM (machinename) MachineStartTime (machinename) ServletEngineInstallTime (servletEngine) ServletEngineConnectTimeWithWS (servletEngine)...

Domain Specification (cont.) Actions Start-Machine (machinename) Duration (= (MachineStartTime (machinename)) Preconditions (not (machineFailed machinename)) effects machineStarted (machinename) Install-Servlet-Engine (servletEngine machinename) Connect-ServletEngine-AS (servletEngine, applicationserver)…) Connect-ServletEngine-WS (servletEngine, webServer)...) …

Initial State Objects applicationserver1 – applicationserver... servletengine2 – servletengine... webserver – webserver database – database machine1 – machine... Initial State machineStarted (machine1) machineFailed (machine2) machineStarted (machine3).. = (machineRAM (machine1) 512) = (machineRAM (machine3) 1024) = (machineRAM (machine4) 1024).. Web Server Servlet Engine 1 Application Server 1 Database Servlet Engine 2 Application Server Machine 1 Machine 2 Machine 3 Machine 4 Machine 5 Clients Normal FailedAffected

Initial State (cont.) Initial State (cont’d) = (machineJDK (machine1) 1.4.2) = machineJDK (machine3) 1.3).. = (machinePlatform (machine1) Unix) = (machinePlatform (machine3) win2k).. servletEngineWorking (servletengine2, machine5) applicationServerWorking (applicationserver2) databaseWorking (database).. Web Server Servlet Engine 1 Application Server 1 Database Servlet Engine 2 Application Server Machine 1 Machine 2 Machine 3 Machine 4 Machine 5 Clients Normal FailedAffected

Goal State servletEngineWorking (servletengine1) applicationServerWorking (applicationserver1, machine3) Metric Minimize Total-time Web Server Servlet Engine 1 Application Server 1 Database Servlet Engine 2 Application Server Machine 1 Machine 3 Machine 4 Machine 5 Clients

Plan 1.Install-Servlet-Engine (servletEngine1, machine1) 2.Connect-ServletEngine-AS (servletEngine1, applicationserver1) 3.Connect-ServletEngine-WS (servletEngine1, webServer)) 4.Connect-Client … 5.… Web Server Servlet Engine 1 Application Server 1 Database Servlet Engine 2 Application Server Machine 1 Machine 3 Machine 4 Machine 5 Clients Normal FailedAffected

Present Work Prototype (Planit) is Under Development –Sensing Java based sensing framework using Siena –Planning using planner named LPG-TD (Universit‘a degli Studi di Brescia) –Currently, using applications developed on Prism middleware (USC/UCI) as our target applications

Open Questions Dependency Modeling –How and when the dependencies should be updated? Static vs. Dynamic ? –Which dependency model to be used? System Learning –How the system learns over time? Case Based Reasoning ?

Summary Our initial results show promising prospects for using planning in failure recovery The next step is to use this technique in highly distributed systems and in other areas like –Performance Improvement –Distributed System Management –Fault Tolerance

Data Flow Diagram

Experimental Setup Experiment NoComponentsConnectorsMachines

Explicit Configurations ExperimentNo of Plans Found (in 30 sec) Time to Find the Best Plan (in sec) Duration of the Best Plan (in sec) Duration of the worst Plan (in sec) N/A

Implicit Configurations ExperimentNo of Plans Found (in 60 sec) Time to Find the Best Plan (in sec) Duration of the Best Plan (in sec) Duration of the Worst Plan (in sec) N/A 50

Sensing Getting the information –Inserting sensors in the components and machines to detect failures using heartbeats and explicit pinging –A monitor receives the raw information and makes decision about a failure –Monitors can also be stacked in subsystems to form a hierarchy –Monitors can change various parameters to reduce the impact on the network

Other Potential Areas Fault Tolerance –To prevent faults from developing that lead to a failure System Management –Automated management of the systems Performance Improvement –Improve the performance of the system using planning May need some modifications in our approach to accommodate these areas

Acting The plan is converted into a executable script The script is executed on the system for recovery A feedback loop is established to find if the recovery process is carried out successfully