Doctor PoC Booth Vitrage Demo
Vitrage in a nutshell Official OpenStack project for Root Cause Analysis Vitrage Functions Root Cause Analysis Understand what caused faults to occur Deduced alarms and states Raising alarms and modifying states based on system insights Holistic & complete view of the system
Vitrage User Interface (Horizon plug-in) Hierarchical View Root Cause Analysis
Vitrage – Under the Hood Architecture Highlights Resource topology graph Reflect how different entities relate to one another RCA is all about relationships! Multiple data sources Extendible Easy to add new data sources Configurable business logic Template-based behavior Clear visualization of Vitrage insights
Vitrage as Doctor Inspector Push and pull interfaces to various monitoring tools (e.g. Nagios, Zabbix) and to OpenStack projects -> fast failure notification Mapping between physical and logical failures Expose more faults and changes to resources (deduced alarms and states) Provide Root Cause Analysis indicators to the application manager Can be configured differently for different systems
Vitrage + Doctor – Demo Flow Vitrage receives alarm from Nagios once host NIC fails. Vitrage finds affected resources based on its cloud topology graph and templates. Vitrage calls Nova to mark host down and update the states of the affected VM(s). Nova notifies AODH which raises an alarm, Application Manager gets alarm notification and transfers control to VM2 (STBY) Monitor Aodh Ceilometer Manager Virtualized Infrastructure (Resource Pool) Alarm Conf. Application Controller Nova Resource Map Vitrage 6. Notify all 7. Notify Error 1. Configure Aodh event alarm Templates Nagios 3. Notify Raw Failure 4. Find Affected 5. Update State 2. Monitor HOST1 VM1 HOST2 VM2 HOST NIC 8. Switch to STBY
Vitrage Demo – Root Cause Analysis NIC Down Host Down VM 1 unreachable Service unavailable VM unreachable Nagios Alarm Vitrage/ AODH Alarm Vitrage Alarm Causes
Vitrage Demo – States View Availability Zone Host 1 Host 2 VM1 VM 2 My Service 1 My Service 2 APP Before failure ACT STBY Availability Zone Host 1 Host 2 VM1 VM 2 My Service 1 My Service 2 APP After NIC failure ACT ERROR STBY