RM3G: Next Generation Recovery Manager

Slides:

Advertisements

Similar presentations

Insider for Oracle The Art Of Performance Tuning.

Advertisements

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.

Chapter 13 Physical Architecture Layer Design

1 A Research Program in Reliable Adaptive Distributed Systems (RADS) Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica,

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

New Challenges in Cloud Datacenter Monitoring and Management

 Distributed Software Chapter 18 - Distributed Software1.

11 MAINTAINING THE OPERATING SYSTEM Chapter 5. Chapter 5: MAINTAINING THE OPERATING SYSTEM2 CHAPTER OVERVIEW Understand the difference between service.

MICHAEL EDDINGTON Advanced Fuzzing with Peach 2.

Networked File System CS Introduction to Operating Systems.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Software Architecture

Cluster Reliability Project ISIS Vanderbilt University.

1 RADS Conceptual Architecture Commodity Internet & IP networks Edge Network Distributed Middleware Client SLT Services Distributed Middleware Server Router.

NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.

OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.

Management for IP-based Applications Mike Fisher BTexaCT Research

OS2- Sem ; R. Jalili Introduction Chapter 1.

Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.

AMSA TO 4 Advanced Technology for Sensor Clouds 09 May 2012 Anabas Inc. Indiana University.

Introducing Flink on Mesos Eron Wright – DELL

SQL Database Management

Chapter 1 Characterization of Distributed Systems

Business System Development

Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Orchestration and Controller Alignment for ONAP Release 1

Transforming VLC into an SA-Aware Application

Chapter 1: Introduction

Node.js Express Web Applications

Large-scale file systems and Map-Reduce

Database Management:.

Definition of Distributed System

Chapter 3 Internet Applications and Network Programming

NOX: Towards an Operating System for Networks

Introduction to Data Management in EGI

Peer-to-peer networking

Software Design and Architecture

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Parallel Data Laboratory, Carnegie Mellon University

Overview of SDN Controller Design

CHAPTER 3 Architectures for Distributed Systems

Systems Analysis and Design With UML 2

CHAPTER 2 CREATING AN ARCHITECTURAL DESIGN.

Real-time Software Design

#01 Client/Server Computing

GGF15 – Grids and Network Virtualization

Replication Middleware for Cloud Based Storage Service

Unit 27: Network Operating Systems

Advanced Operating Systems

湖南大学-信息科学与工程学院-计算机与科学系

Providing Secure Storage on the Internet

QNX Technology Overview

CS110: Discussion about Spark

Cloud computing mechanisms

Internet Protocols IP: Internet Protocol

Component Technology Bina Ramamurthy 2/25/2019 B.Ramamurthy.

Peer-to-peer networking

EE 122: Lecture 22 (Overlay Networks)

Self-Managed Systems: an Architectural Challenge

Research Issues in Middleware (Bhaskar)

Introduction of Week 5 Assignment Discussion

#01 Client/Server Computing

ONAP Architecture Principle Review

Presentation transcript:

RM3G: Next Generation Recovery Manager Steve Zhang and Armando Fox Stanford University

Design Goals Overall Goal: Manage the detection of and recovery from system failures New in 3G: Focus on online Statistical Learning Theory (SLT) algorithms for application generic failure detection Previous generation used End-2-End and Exception monitors Not tie ourselves to any particular algorithms and make new algorithms easy to plug-in Standardize the APIs for observation, analysis, and control of system components Provide common services and abstractions to SLT algorithms RM itself must also be resilient to failures SLTs RM3G Comp

Commodity Internet & IP networks RADS Architecture User Operator Distributed Middleware Client Distributed Middleware Server SLT Services (RM3G) Application- Specific Overlay Network PNE PNE Edge Network Edge Network Router Router Commodity Internet & IP networks

Design Diagram RM SLT Processes SLT Plug-ins RMDB Comp B Comp A Comp C Observation Points Control Points SLT Processes Spawned by SLT Proc Srv Ctrl/Obsrv point descriptors Control policies SLT Plug-ins Data Store Srv SLT Select Srv Ctrl Srv RM Proc Srv RMDB Name & Reg Srv

Collaboration with ACME Infrastructure for monitoring, analyzing, and controlling Internet-scale systems Sensors = Observation Points Actuators = Control Points RM potentially benefits from two ACME features An in-network aggregator combines data from sensors as they are routed through an overlay network Configuration language that specifies under what conditions to trigger actuators ACME could benefit from more powerful sensor data analysis using SLTs

Observation Points We want to avoid requiring every component to be individually instrumented Components may directly provide their own observation data if they wish (e.g. D-store and SSM provide their own data for monitoring with Pinpoint) Several types of observation data can be collected in an application generic way OS can provide application level data (e.g. memory usage, number of files open, etc) and system level data (e.g. size of swap space, network ports used, etc) Middleware can provide intra-application data (e.g. interaction between different components of an application)

SLT Data Services Abstracts information from observation points SLT algorithms are spawned for each component in the system, as they are instantiated Observation data stored by SLT Data Server possibly in a streaming database. Listens for feedback from SLT algorithms to adjust the data stream as necessary Increase data sampling rate if anomaly is suspected Stop reporting certain data if it is deemed to be irrelevant Provide persistent data storage for SLT algorithms Remember properties learned from previous analysis of observation data

Control Points Assumes crash-only components Components can be reliably restarted through external means (can’t rely on components restarting themselves cleanly) Initially, only restart control points are supported Instrument application server (JBoss) to restart applications and application components OS can restart application servers IP addressable power strips can restart entire nodes Components can specify custom control policy Leverage ACME’s configuration language

Future Work “Master” SLT Support additional types of control points Multiple SLTs are run for each component. Choosing which SLTs to believe is itself an interesting SLT problem. Support additional types of control points Multiple level settings that tune component parameters (e.g. filter level) Support additional types of observation points Use programming language techniques (e.g. source code transformation) to instrument applications in a generic way Online SLT algorithms for anomaly detection are not mature