Presentation is loading. Please wait.

Presentation is loading. Please wait.

RM3G: Next Generation Recovery Manager

Similar presentations


Presentation on theme: "RM3G: Next Generation Recovery Manager"— Presentation transcript:

1 RM3G: Next Generation Recovery Manager
Steve Zhang and Armando Fox Stanford University

2 Design Goals Overall Goal: Manage the detection of and recovery from system failures New in 3G: Focus on online Statistical Learning Theory (SLT) algorithms for application generic failure detection Previous generation used End-2-End and Exception monitors Not tie ourselves to any particular algorithms and make new algorithms easy to plug-in Standardize the APIs for observation, analysis, and control of system components Provide common services and abstractions to SLT algorithms RM itself must also be resilient to failures SLTs RM3G Comp

3 Commodity Internet & IP networks
RADS Architecture User Operator Distributed Middleware Client Distributed Middleware Server SLT Services (RM3G) Application- Specific Overlay Network PNE PNE Edge Network Edge Network Router Router Commodity Internet & IP networks

4 Design Diagram RM SLT Processes SLT Plug-ins RMDB
Comp B Comp A Comp C Observation Points Control Points SLT Processes Spawned by SLT Proc Srv Ctrl/Obsrv point descriptors Control policies SLT Plug-ins Data Store Srv SLT Select Srv Ctrl Srv RM Proc Srv RMDB Name & Reg Srv

5 Collaboration with ACME
Infrastructure for monitoring, analyzing, and controlling Internet-scale systems Sensors = Observation Points Actuators = Control Points RM potentially benefits from two ACME features An in-network aggregator combines data from sensors as they are routed through an overlay network Configuration language that specifies under what conditions to trigger actuators ACME could benefit from more powerful sensor data analysis using SLTs

6 Observation Points We want to avoid requiring every component to be individually instrumented Components may directly provide their own observation data if they wish (e.g. D-store and SSM provide their own data for monitoring with Pinpoint) Several types of observation data can be collected in an application generic way OS can provide application level data (e.g. memory usage, number of files open, etc) and system level data (e.g. size of swap space, network ports used, etc) Middleware can provide intra-application data (e.g. interaction between different components of an application)

7 SLT Data Services Abstracts information from observation points
SLT algorithms are spawned for each component in the system, as they are instantiated Observation data stored by SLT Data Server possibly in a streaming database. Listens for feedback from SLT algorithms to adjust the data stream as necessary Increase data sampling rate if anomaly is suspected Stop reporting certain data if it is deemed to be irrelevant Provide persistent data storage for SLT algorithms Remember properties learned from previous analysis of observation data

8 Control Points Assumes crash-only components
Components can be reliably restarted through external means (can’t rely on components restarting themselves cleanly) Initially, only restart control points are supported Instrument application server (JBoss) to restart applications and application components OS can restart application servers IP addressable power strips can restart entire nodes Components can specify custom control policy Leverage ACME’s configuration language

9 Future Work “Master” SLT Support additional types of control points
Multiple SLTs are run for each component. Choosing which SLTs to believe is itself an interesting SLT problem. Support additional types of control points Multiple level settings that tune component parameters (e.g. filter level) Support additional types of observation points Use programming language techniques (e.g. source code transformation) to instrument applications in a generic way Online SLT algorithms for anomaly detection are not mature


Download ppt "RM3G: Next Generation Recovery Manager"

Similar presentations


Ads by Google