Fault Management with OpenStack Congress and Vitrage, Based on OPNFV Doctor Framework Barcelona 2016 Ryota Mibu NEC Ohad Shamir Nokia Masahito Muroi.

Slides:



Advertisements
Similar presentations
High Availability Project Qiao Fu Project Progress Project details: – Weekly meeting: – Mailing list – Participants: Hui Deng
Advertisements

Seamless migration from Nova-network to Neutron in eBay production Chengyuan Li, Han Zhou.
Doctor Implementation Plan (Discussion) Feb. 6, 2015 Ryota Mibu, Tomi Juvonen, Gerald Kunzmann, Carlos Goncalves.
Profit from the cloud TM Parallels Dynamic Infrastructure AndOpenStack.
TOSCA Monitoring Straw-man for Initial Minimal Monitoring Use Case Roger Dev CA Technologies April 24, 2015.
Virtualized Infrastructure Deployment Policies (Copper) 19 February 2015 Bryan Sullivan, AT&T.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.
24 February 2015 Ryota Mibu, NEC
(OpenStack Ceilometer)
1 Doctor Fault Management 18 May 2015 Ryota Mibu, NEC.
**DRAFT** Doctor Southbound API 14 April 2015 Ryota Mibu, NEC.
1 Doctor Fault Management - Updates - 30 July 2015 Ryota Mibu, NEC.
UI and Data Entry UI and Data Entry Front-End Business Logic Mid-Tier Data Store Back-End.
TOSCA Monitoring Straw-man for Initial Minimal Monitoring Use Case Roger Dev CA Technologies Revision 3 May 21, 2015.
**DRAFT** Blueprints Alignment (OpenStack Ceilometer) 4 March 2015 Ryota Mibu, NEC.
Fault Localization (Pinpoint) Project Proposal for OPNFV
Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC
Extending OVN Forwarding Pipeline Topology-based Service Injection
Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
1 OPNFV Summit 2015 Doctor Fault Management Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC.
TOSCA Monitoring Straw-man for Initial Minimal Monitoring Use Case Roger Dev CA Technologies Revision 2 May 1, 2015.
Service Charging Platform. EMS (Entity Management System) 0 Logging Agent Provides detailed activity logs and reports all raw facts as they happen to.
**DRAFT** Doctor Host Maintenance Maintenance changes to OpenStack Nova 17 May 2016 Tomi Juvonen Nokia.
© 2015 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. 1 VF (Virtual Functions) Event.
Ashiq Khan NTT DOCOMO Congress in NFV-based Mobile Cellular Network Fault Recovery Ryota Mibu NEC Masahito Muroi NTT Tomi Juvonen Nokia 28 April 2016OpenStack.
Ashiq Khan NTT DOCOMO Congress in NFV-based Mobile Cellular Network Fault Recovery Ryota Mibu NEC Masahito Muroi NTT Tomi Juvonen Nokia 28 April 2016OpenStack.
What is OPNFV? Frank Brockners, Cisco. June 20–23, 2016 | Berlin, Germany.
Failure Inspection in Doctor utilizing Vitrage and Congress
Doctor Host Maintenance Maintenance changes to OpenStack Nova 13 Jun 2016 Tomi Juvonen Nokia.
Congress Blueprint --policy abstraction
**DRAFT** Doctor+Congress OPNFV Summit June 2016 Doctor+Congress PoC team.
Doctor Tech Deep Dive Tomi Juvonen, Nokia Ryota Mibu, NEC.
NFV Infrastructure Maintenance Automation by OPNFV Doctor
Keeping My (Telco) Cloud Afloat
Master Service Orchestrator (MSO)
X V Consumer C1 Consumer C2 Consumer C3
Doctor + OPenStack Congress
Ashiq Khan, NTT DOCOMO Ryota Mibu, NEC
Maintenance changes to OpenStack Nova 21 Jun 2016 Tomi Juvonen Nokia
Principles of Computer Security
OPNFV Doctor - How OPNFV project works -
Doctor PoC Booth Vitrage Demo
Multi-VIM/Cloud High Level Architecture
ONAP – Centralised Parser Distribution Atul Purohit - Vodafone
Tomi Juvonen SW Architect, Nokia
Storage Virtualization
Tomi Juvonen Software Architect, Nokia
OpenStack Ani Bicaku 18/04/ © (SG)² Konsortium.
Infrastructure Maintenance & Upgrade: Zero VNF Downtime with OPNFV Doctor on OCP Hardware AirFrame Open Rack V1 and V2 compatible Demo video:
Proactive RCA with Vitrage, Kubernetes, Zabbix and Prometheus
Northbound API Dan Shmidt | January 2017
Vitrage Project Update, OpenStack Summit Vancouver
Chapter 2 Database Environment Pearson Education © 2009.
Vitrage hands-on lab Muhamad Najjar, Eyal Bar-Ilan CloudBand, Nokia
Mix & Match: Resource Federation
On the Way to Cloud Native:
Managing Services with VMM and App Controller
Technical Capabilities
Vitrage hands-on lab Muhamad Najjar, Marina Koushnir CloudBand, Nokia
OpenStack Ceilometer Blueprints for Liberty
Chapter 2 Database Environment Pearson Education © 2009.
Vitrage Project Update, OpenStack Summit Berlin
Chapter 2 Database Environment Pearson Education © 2009.
Doctor OpenStack Controller changes Tomi Juvonen Nokia
**DRAFT** NOVA Blueprint 03/10/2015
**DRAFT** Doctor Southbound API 23 Feb 2016 Ryota Mibu, NEC.
Doctor Host Maintenance
Presentation transcript:

Fault Management with OpenStack Congress and Vitrage, Based on OPNFV Doctor Framework Barcelona 2016 Ryota Mibu NEC Ohad Shamir Nokia Masahito Muroi NTT

Contents Failure Inspection in OPNFV Doctor - Ryota OpenStack Vitrage - Ohad OpenStack Congress - Masa

Failure Inspection in OPNFV Doctor

Virtualized Platform VM VM VM Net Net PM PM H/W Switch H/W Switch Volume Port Port Port Volume Net Net PM PM H/W Switch H/W Switch

Virtualized Platform X X VM VM VM Net Net PM PM H/W Switch H/W Switch Volume X Port Port Port Volume Net Net PM PM H/W Switch H/W Switch

What is “failure”? Depends on So, “failure” has to be configurable Applications (VNFs) Back-end technologies used in the deployment Redundancy of the equipment/components Operator Policy Regulation Topologies of Network / Power-supply So, “failure” has to be configurable

Fault Management Architecture Designed by Doctor Project Applications Manager 0. Set Alarm 6-. Action 5. Notify Error Virtualized Infrastructure 4. Notify all Controller Controller Notifier Nova Controller Resource Map Alarm Conf. Neutron Ceilometer+Aodh Cinder 3. Update State 2. Find Affected 4. (alt) Notify Monitor Monitor Inspector Monitor Failure Policy Congress 1. Raw Fault Vitrage

State Correction APIs Nova Neutron Cinder Reset Server State POST /servers/{server_id}/action { "os-resetState": { "state": “error" } } Update Forced Down PUT /os-services/force-down { "host": "host1", "binary": "nova-compute", "forced_down": true } Neutron Update port data plane status WIP https://review.openstack.org/351675/ PUT /v2.0/ports/<port-uuid> { "port": { "dp_down": true } } Cinder Reset a volume's statuses POST /volumes/{volume_id}/action { "os-reset_status": { "status": "available", "attach_status": "detached", "migration_status": "migrating" } }

Inspector Module Options OpenStack Vitrage Various data source (OpenStack, Nagios, etc.) Ability to store and refer physical topologies for correlation Holistic & complete view of the system OpenStack Congress Dynamic data collection from OpenStack services Flexible policy definition for correlation (Datalog) Well integrated with other OpenStack projects

Stained glass art produced through the combination of brilliantly colored glass in varying degrees of transparency, creating a dynamic art form which is transformed with every variation in light Stained glass window Taken alone they don’t make sense Together they create a pattern

Vitrage in a nutshell Official OpenStack project for Root Cause Analysis Vitrage Functions Root Cause Analysis – understand why faults occurred Deduced alarms and states – raising alarms and modifying states based on system insights Holistic & complete view of the system

Vitrage – Architecture Highlights Easily extendible to add new data sources Multiple Data sources Reflects how entities relate to one another Entity Topology Graph Template-based behavior Configurable business logic

Vitrage High Level Architecture Horizon Plug-in: Hierarchical view Vitrage alarm list RCA diagram per alarm Entity graph view Templates list Expose Vitrage alarms and state changes to other projects or external systems Multiple Data Sources (extendible): External monitoring tools: Nagios, Zabbix OpenStack projects Physical topology Templates for deduced alarms and RCA: Each template can contain one or more scenarios (scenario = condition + action) Human readable Configurable

Vitrage Template Template contains three sections: Metadata – name, description of template Definitions – entities & relationships Scenarios – conditions & actions Each template can contain one or more scenarios (condition + action/s) YAML format, human readable Configurable

Template Example – Host high CPU load Scenario 1 – Raise Alarm When high CPU load on host and host contains instance: Raise deduced alarm on instance named cpu performance degradation Set state to suboptimal on instance Scenario 2 – RCA When high CPU load on host and host contains instance and cpu performance degradation alarm on instance, add causal relationship Scenario 3 – Set host state When high CPU load on host, set host state to suboptimal

Template Example – Host high CPU load Raise alarm on VM (deduced alarm and state) 1 Add causal relationship 2 Set host state 3

How does it work? Entity Graph + Evaluator Vitrage Evaluator listens to change events in the entity graph and upon event: Retrieve templates (scenarios) relevant to event Evaluate condition against the state of the Entity Graph Execute actions for each matched condition E B A C C A A B B B C Sub-graph Matching F A C F C G G G

Let’s see Vitrage Stained glass window Taken alone they don’t make sense Together they create a pattern

Vitrage as Doctor Inspector Push and pull interfaces to various monitoring tools (e.g. Nagios, Zabbix) and to OpenStack projects -> fast failure notification Mapping between physical and logical failures Expose more faults and changes to resources (deduced alarms and states) Provide Root Cause Analysis indicators to the application manager Can be configured differently for different systems

OpenStack Congress

What is Congress? Goal of Congress Governance as a Service Policy Define and enforce policy for Cloud Services Policy No single definition Law/Regulations Business Rules Security Requirements Application Requirements Datalog style policy Goal of Congress Any Service, any Policy

Congress Architecture Policy Data Policy Monitor in Doctor Policy Enforcement Congress API Policy Engine Nova DataSourceDriver Neutron DataSourceDriver Doctor DataSourceDriver Nova Neutron

Congress Doctor Driver Data Policy 1. Policy Monitor Policy Enforcement Doctor data flow Congress API 2. Doctor DataSourceDriver 4. Policy Engine 3. Nova DataSourceDriver Monitor notifies hardware failure event to Congress Doctor Driver receives failure event, insert it to event list of Doctor Data Policy Engine receives the failure event, then evaluate registered policy and enforce state correction Policy Engine instruct Nova Driver to perform host service force down and reset state of VM(s) Neutron DataSourceDriver Keystone DataSourceDriver Security System DataSourceDriver Nova Neutron Keystone Security System

Congress Doctor Driver (Detail) Driver Schema(HW failure example) +--------+-----------------------------------------------------+ | table | columns | +--------+-----------------------------------------------------+ | events | {'name': 'id', 'description': 'None'}, | | | {'name': 'time', 'description': 'None'}, | | | {'name': 'type', 'description': 'None'}, | | | {'name': 'hostname', 'description': 'None'}, | | | {'name': 'status', 'description': 'None'}, | | | {'name': 'monitor', 'description': 'None'}, | | | {'name': 'monitor_event_id', 'description': 'None'} | +--------+-----------------------------------------------------+ Event List of Doctor Data +----------------+-------------------------------+----------------+---------------+--------+--------------+------------------+ | id | time | type | hostname | status | monitor | monitor_event_id | +----------------+-------------------------------+----------------+---------------+--------+--------------+------------------+ | 0123-4567-89ab | 2016-03-09T07:39:27.230277464 | host.nic1.down | demo-compute0 | down | demo_monitor | 111 | +----------------+-------------------------------+----------------+---------------+--------+--------------+------------------+

Policy for Inspector A hypervisor and instances on the hypervisor must be down state/error status if some errors are reported by Monitor.

Policy of Congress List hypervisors that violate the policy Still in “up” state even though it’s reported error happens List instances that violate the policy Still in “Active” status even though it’ reported error happens binary_list("nova-compute") active_hypervisor(hypervisor, binary):- nova:services(host=hypervisor, binary=binary, status="up"), binary_list(binary), error_reported_hosts(hypervisor) active_instance_in_host(vmid, hypervisor):- nova:servers(id=vmid, host_name=hypervisor, status="ACTIVE"), error_reported_hosts(hypervisor)

Policy of Congress How to fix the violation mark host down the hypervisor Set the vm status to “error” execute[nova:services.force_down(host, binary, "True")] :- active_compute_hypervisor(host, binary) execute[nova:servers.reset_state(vmid, "error")] :- active_instance_in_host(vmid, host)

Scenario 1: sensitive app/operator Defines both non-broken events and broken events as a failure mark_down_events(’host.nic1.down') mark_down_events(’host.nic2.down') mark_down_events(‘host.cpu.high-load’) error_reported_hosts(hosts):- doctor:events(hostname=hosts, type=event_t, status=“down”), mark_down_events(event_t)

Scenario 2: insensitive app/operator Defines only broken events as a failure mark_down_events(’host.nic1.down') mark_down_events(’host.nic2.down’) mark_down_events(‘host.cpu.high-load’) error_reported_hosts(hosts):- doctor:events(hostname=hosts, type=event_t, status=“down”), mark_down_events(event_t)

Thank You!