DAM-Alarming Data Analytics from Monitoring, for alarming Summer Student Project 2015 A. Martin, C. Cristovao, G. Domenico thanks to Luca Magnoni IT-SDC-MI.

Slides:

Advertisements

Similar presentations

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO

Advertisements

Evaluation of NoSQL databases for DIRAC monitoring and beyond

ATSN 2009 Towards an Extensible Agent-based Middleware for Sensor Networks and RFID Systems Dirk Bade University of Hamburg, Germany.

HOL9396: Oracle Event Processing 12c

Polaris Financial Technologies Welcomes the members of Hyderabad chapter for the 2nd event on 4 th July 14 held by PACE (The Testing Practice)

OpStor - A multi vendor storage resource management and capacity forecasting software.

Advanced Monitoring With Complex Stream Processing Magnoni Luca, Cons Lionel IT-SDC Reguero Ignacio IT-PES IT Technical Forum.

TM Herding Penguins with Performance Co-Pilot Ken McDonell Performance Tools Group SGI, Melbourne.

Graph-RAT Overview By Daniel McEnnis. 2/32 What is Graph-RAT  Relational Analysis Toolkit  Database abstraction layer  Evaluation platform  Robustly.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Elasticsearch in Dashboard Data Management Applications David Tuckett IT/SDC 30 August 2013 (Appendix 11 November 2013)

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Module 7: Fundamentals of Administering Windows Server 2008.

Applications of Advanced Data Analysis and Expert System Technologies in the ATLAS Trigger-DAQ Controls Framework G. Avolio – University of California,

Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.

Real Time Monitor of Grid Job Executions Janusz Martyniak Imperial College London.

The Network Performance Advisor J. W. Ferguson NLANR/DAST & NCSA.

Technische Universität München Application Performance Monitoring of a scalable Java web-application in a cloud infrastructure Final Presentation August.

May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

Nikos Kefalakis, John Soldatos, Efstathios Mertikas, Neeli R. Prasad Generating Business Events in an RFID Network.

What’s New in WatchGuard XCS v9.1 Update 1. WatchGuard XCS v9.1 Update 1  Enhancements that improve ease of use New Dashboard items  Mail Summary >

Distributed monitoring system. Why Monitor? Solve them! Identify Problems Ensure conduct Requirements Manage many computers Spot trends in the system.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

1 Oracle Enterprise Manager Slides from Dominic Gélinas CIS

INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.

Information Services Andrew Brown Jon Ludwig Elvis Montero grid:seminar1:lectures:seminar-grid-1-information-services.ppt.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.

WebWatcher A Lightweight Tool for Analyzing Web Server Logs Hervé DEBAR IBM Zurich Research Laboratory Global Security Analysis Laboratory

Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.

ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

WLCG Transfers Dashboard A unified monitoring tool for heterogeneous data transfers. Alexandre Beche.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

A Validation System for the Complex Event Processing Directives of the ATLAS Shifter Assistant Tool G. Anders (CERN), G. Avolio (CERN), A. Kazarov (PNPI),

CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.

The ATLAS Run 2 Expert System Giuseppe Avolio - CERN.

Data Quality Processes in MMEA platform

TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,

Design of a Notification Engine for Grid Monitoring Events and Prototype Implementation Natascia De Bortoli INFNGRID Technical Board Bologna Feb.

CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.

WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.

Detecting Web Attacks Using Multi-Stage Log Analysis

Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng

WinCC-OA Log Analysis SCADA Application Service - Reporting

Open Source distributed document DB for an enterprise

Spark Presentation.

Robert Szuman – Poznań Supercomputing and Networking Center, Poland

Monitoring HTCondor with Ganglia

Remote Monitoring solution

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Monitoring of the infrastructure from the VO perspective

Proactive RCA with Vitrage, Kubernetes, Zabbix and Prometheus

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Serverless Architecture in the Cloud

AIMS Equipment & Automation monitoring solution

The ELK stack - get to know logs

Presentation transcript:

DAM-Alarming Data Analytics from Monitoring, for alarming Summer Student Project 2015 A. Martin, C. Cristovao, G. Domenico thanks to Luca Magnoni IT-SDC-MI 31/08/2015

Introduction Evaluation and implementation of CEP mechanisms to act upon infrastructure metrics monitored by Ganglia WLCG is integrating cloud technologies providing an additional approach for delivering computing capacity Ganglia is being adopted as monitoring system for clouds Re-purposing raw monitoring data in order to detect anomalies DAM-Alarming2

Resource profiling Accounting Alarming DAM-Alarming3 DAM Overview

DAM-Alarming4 architecture UDP TCP DAM-Alarming XMLXML HH H : host event. Contains last metrics’ values per host ESPER SS S : status event. Contains current host status     : intermediate events Event Processing statements

Scalable distributed monitoring system “Designed for high-performance computing systems such as clusters and Grids” Open source project Collecting a standard set of monitoring metrics (cpu, memory, disk, …) Providing web interface for visualizing host performance DAM-Alarming5 Ganglia introducing the distributed monitoring system

Multiple gmonds on server, one per cluster DAM-Alarming6 Expected deployment use case GMOND GMETAD UDP GMOND VM

Contacting each gmond on a server to retrieve monitoring data DAM-Alarming7 Fitting DAM-Alarming in deployment GMOND VM TCP UDP DAM-Alarming

Set of technologies to process events and discover complex patterns among their streams Goal: identify meaningful events and promptly respond to them DAM-Alarming8 Complex Event Processing (CEP) Events Continuous processing Query Notifications

Open source CEP solution Strong performance, in memory computation Strong community and support In use at IT-SDC: Metis monitoring system Simple Java API Event types compliant with several input formats (including POJO and Map objects) Event Processing language (EPL): SQL-like statements DAM-Alarming9 ESPER introducing the event processing engine

Filtering events Aggregation functions Pattern matching Examples of EPL statements select * from pattern [ every s=Status(state=State.OK) -> ( timer:interval(5 minutes) and not s_update=Status(hostname=s.hostname, cluster=s.cluster)) ]; select * from pattern [ every s=Status(state=State.OK) -> ( timer:interval(5 minutes) and not s_update=Status(hostname=s.hostname, cluster=s.cluster)) ]; select avg(cpu_idle) from ReportEvent.win:time(20 minutes); select * from ReportEvent where cpu_idle > 85; DAM-Alarming10

Raw data parsed into GangliaReport events DAM-Alarming11 Monitoring model from events describing raw data GANGLIA GangliaReport All parsed metrics

Raw data parsed into GangliaReport events First EPL statement determines the host state based on cpu_idle Possibility to correlate metrics DAM-Alarming12 select hostname, VO, cluster, monitor, reported, avg(cpu_idle) as ave, cpu_idle as lastVal, 'cpu_idle' as metricName, 'high cpu_idle' as problem case when avg(cpu_idle) > 85 then State.ERROR when avg(cpu_idle) > 70 then State.WARNING when cpu_idle is null then State.UNKNOWN else State.OK end as state from GangliaReport.win:time(var_timeWindowLength min) group by hostname,cluster; select hostname, VO, cluster, monitor, reported, avg(cpu_idle) as ave, cpu_idle as lastVal, 'cpu_idle' as metricName, 'high cpu_idle' as problem case when avg(cpu_idle) > 85 then State.ERROR when avg(cpu_idle) > 70 then State.WARNING when cpu_idle is null then State.UNKNOWN else State.OK end as state from GangliaReport.win:time(var_timeWindowLength min) group by hostname,cluster; Monitoring model first layer of EPL statements GANGLIA GangliaReport All parsed metrics EPL

DAM-Alarming13 Monitoring model events describing host state GANGLIA GangliaReport All parsed metrics Check State, relevant metric, average EPL Raw data parsed into GangliaReport events First EPL statement determines the host state based on cpu_idle Check Events describe the hosts state at current time

Raw data parsed into GangliaReport events First EPL statement determines the host state based on cpu_idle Check Events describe the hosts state at current time Second EPL keeps only Check events indicating state transitions: new event Status DAM-Alarming14 select * from Check.std:groupwin(hostname, cluster).win:length(2) where state is not prev(1, state) and prev(1, state) is not null; select * from Check.std:groupwin(hostname, cluster).win:length(2) where state is not prev(1, state) and prev(1, state) is not null; Monitoring model second level of EPL statements GANGLIA GangliaReport All parsed metrics Check State, relevant metric, average Status State, relevant metric, average EPL

Raw data parsed into GangliaReport events First EPL statement determines the host state based on cpu_idle Check Events describe the hosts state at current time Second EPL keeps only Check events indicating state transitions: new event Status State transitions trigger alarms DAM-Alarming15 GANGLIA GangliaReport All parsed metrics Check State, relevant metric, average Status State, relevant metric, average ALARM EPL Monitoring model alarming on status change

DAM-Alarming16 Monitoring model v1.1 lessons learnt GANGLIA GangliaReport All parsed metrics Check State, relevant metric, average Status State, relevant metric, average ALARM EPL Filtering notifications became crucial for alarming Third statement was added All status events still inserted into ES

Filtering notifications became crucial for alarming Third statement was added All status events inserted into ES DAM-Alarming17 Upgrading monitoring model filtering notifications GANGLIA GangliaReport All parsed metrics Check State, relevant metric, average Status State, relevant metric, average EPL Notification State, relevant metric, average EPL ElasticSearch

DAM-Alarming 18 Current deployment details Querying all 5 production Ganglia servers Used by ATLAS, CMS and LHCb Poll interval: 15 seconds 1.5 seconds timeout for retrieving from one server Ganglia Information System ? JSON XML

DAM-Alarming19 Parsing the Retrieved Data understanding the monitoring data......# total of 29 metrics per host (default) snippet

Parsing into a Map object using SAX library Metrics describing gmond and report (timestamp, TN, hostname, cluster,...) Metrics describing host status (cpu_idle, mem_free, cpu_num, proc_total) DAM-Alarming20 Parsing the Retrieved Data from XML to ESPER Event Map event = new HashMap (); event.put("hostname", "lcg-wn02-02.hep.ucl.ac.uk"); event.put(“tn", 257); event.put(“reported", ); event.put("cpu_idle", 85.0f); event.put("cpu_num", 5); event.put("proc_total", 4); event.put("mem_free", ); event.put(“gmondStarted", ); event.put(“location", "unspecified");

First run 17 hours, produced s, more than s/hour Averaged over 2 minutes, Second run 16 hours, 500 /hour Averaged over 15 minutes, tweaked first statement DAM-Alarming21 First statistics from live deployment analyzing output of first two runs connected to production servers Notifications per host in first run

Need to get rid of false positives 6 mails in 40 minutes DAM-Alarming22 Filtering notifications hosts oscillating around fixed threshold ERROR WARNING OK 85 70

Filtering out the notifications Combining states WARNING and ERROR into one DAM-Alarming23 Filtering notifications hosts oscillating around fixed threshold ERROR WARNING OK

Filtering out the notifications Combining states WARNING and ERROR into one Reporting only long lasting OK state DAM-Alarming24 Filtering notifications hosts oscillating around fixed threshold 85 ERROR WARNING OK 70

DAM-Alarming25 Filtering notifications successful example

Connected to ElasticSearch Sending messages via log4j to Logstash Visualizing using Kibana 3 and Kibana 4 Improved statistics evaluation efficiency DAM-Alarming26 Evaluating statistics during development

Simplifying aggregations DAM-Alarming27 Kibana 4 more examples

Real run statistics: Time: :00 – : machines 8737 Status updates in ES 106 s DAM-Alarming28 Concept proven! Performance example after introducing mail notification filtering

DAM-Alarming29 Challenges in alarming hosts flapping at different rates Flapping Oscillating between states Difficult to detect Multiple kinds Variable period

DAM-Alarming30 Challenges in alarming hosts flapping at different rates Flapping Oscillating between states Difficult to detect Multiple kinds Variable period Detection based on fixed values Number of status changes in a fixed time window

Detecting Unreachable status by checking the freshness of the report DAM-Alarming31 Challenges in alarming unreachable hosts

DAM-Alarming32 Future Work Improve classification Flapping detection Revision of alarming model (aggregate alarms per cluster)

DAM-Alarming33 Project documentation GitBook documentation with link to JavaDoc

DAM-Alarming34 Project documentation GitBook documentation with link to JavaDoc

Testing all components Using JUnit testing suite to implement Unit test DAM-Alarming35 Testing applications components

Using JUnit testing suite Help when developing statements Could control time to test time windows and time based patterns DAM-Alarming36 Testing EPL statements // initializing ESPER engine and load EPL modules EPRuntime cepRT = initEsper("test"); // creating test event Map eventOK = new HashMap (); eventOK.put("hostname", "lhcb-cloud.cern.ch"); eventOK.put("reported", ); eventOK.put("state", State.OK); // sending first event cepRT.sendEvent(eventOK, EVENT_TYPE); // shifting time long timeInMillis = System.currentTimeMillis(); timeInMillis += 15000; timeEvent = new CurrentTimeEvent(timeInMillis); cepRT.sendEvent(timeEvent); // sending second event cepRT.sendEvent(eventOK, EVENT_TYPE); // initializing ESPER engine and load EPL modules EPRuntime cepRT = initEsper("test"); // creating test event Map eventOK = new HashMap (); eventOK.put("hostname", "lhcb-cloud.cern.ch"); eventOK.put("reported", ); eventOK.put("state", State.OK); // sending first event cepRT.sendEvent(eventOK, EVENT_TYPE); // shifting time long timeInMillis = System.currentTimeMillis(); timeInMillis += 15000; timeEvent = new CurrentTimeEvent(timeInMillis); cepRT.sendEvent(timeEvent); // sending second event cepRT.sendEvent(eventOK, EVENT_TYPE);

DAM-Alarming37 Conclusion Successfully implemented an anomaly detection and alarming system based on raw monitoring data coming from Ganglia Tested on 5 production cloud monitoring Ganglia servers Detecting anomalies based on cpu_idle and reporting them without spamming Injecting all status transitions into ES Processing more than events/hour and can scale up

DAM-Alarming39 Challenges in alarming host flapping and spamming

DAM-Alarming40 First statistics from live deployment analyzing output of first two runs connected to production servers Notifications per host in first runNotifications per host in second run

DAM-Alarming41 First statistics from live deployment analyzing output of first two runs connected to production servers Notifications per cluster in first run Notifications per cluster in second run

DAM-Alarming42 First statistics from live deployment analyzing output of first two runs connected to production servers