IT Monitoring Service Status and Progress 1 Alberto AIMAR, IT-CM-MM.

Slides:



Advertisements
Similar presentations
Testing as a Service with HammerCloud Ramón Medrano Llamas CERN, IT-SDC
Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Input from CMS Nicolò Magini Andrea Sciabà IT/SDC 5 July 2013.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
WLCG Transfers Dashboard A unified monitoring tool for heterogeneous data transfers. Alexandre Beche.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.
Monitoring Evolution 1 Alberto AIMAR, IT-CM-MM. Outline Mandate Data Centres Monitoring Experiments Dashboards Architecture Plans Status Demo 2.
Kibana, Grafana and Zeppelin on Monitoring data
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Experiments Monitoring Plans and Progress
Daniele Bonacorsi Andrea Sciabà
Connected Infrastructure
CERN Data Analytics Use Cases
WLCG IPv6 deployment strategy
Monitoring Evolution and IPv6
WLCG Workshop 2017 [Manchester] Operations Session Summary
Update on CERN IT Unified Monitoring Architecture (UMA)
NGI and Site Nagios Monitoring
System Monitoring with Lemon
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Data Analytics and CERN IT Hadoop Service
Hadoop and Analytics at CERN IT
ATLAS Grid Information System
Evolution of tools for WLCG operations Julia Andreeva, CERN IT
WinCC-OA Log Analysis SCADA Application Service - Reporting
Key Activities. MND sections
ALICE Monitoring
POW MND section.
CWG10 Control, Configuration and Monitoring
Collecting heterogeneous data into a central repository
New monitoring applications in the dashboard
Experiment Dashboard overviw of the applications
New Big Data Solutions and Opportunities for DB Workloads
IT Monitoring Service Status and Progress
1 VO User Team Alarm Total ALICE ATLAS CMS
Connected Infrastructure
A Messaging Infrastructure for WLCG
Data Analytics and CERN IT Hadoop Service
Monitoring Of XRootD Federation
Ákos Frohner EGEE'08 September 2008
Data Analytics and CERN IT Hadoop Service
Solutions for federated services management EGI
Monitoring of the infrastructure from the VO perspective
Data Analytics and CERN IT Hadoop Service
Data Analytics – Use Cases, Platforms, Services
Building a minimum viable Security Operations Centre
Presentation transcript:

IT Monitoring Service Status and Progress 1 Alberto AIMAR, IT-CM-MM

Outline Monitoring Data Centres Experiments Dashboards Architecture and Technologies Status and Plans 2

Monitoring Data Centre Monitoring Monitoring of DC at CERN and Wigner Hardware, operating system, and services Data Centres equipment (PDUs, temperature sensors, etc.) Used by service providers in IT, experiments Experiment Dashboards Sites availability, data transfers, job information, reports Used by WLCG, experiments, sites and users Both hosted by CERN IT, in different teams 3

Mandate Focus for 2016 Regroup monitoring activities hosted by CERN/IT (Data Centres, Experiment Dashboards, ETF, HammerCloud, etc) Continue existing services Uniform with CERN IT practices Management of services, communication, tools (e.g. GGUS and SNOW tickets) Starting with Merge Data Centres and Experiment Dashboards monitoring technologies Review existing monitoring usage and needs (DC, WLCG, etc) Investigate new technologies Unchanged support while collecting feedback and working 4

Data Centres Monitoring 5

6

Experiment Dashboards 7 Analysis + Production Real time and Accounting views Data transfer Data access Site Status Board SAM3 Google Earth Dashboard users per day Data Management Monitoring Job Monitoring Infrastructure Monitoring Outreach Sites Operation Teams Users General Public

Experiment Dashboards 8 Job monitoring, sites availability, data management and transfers Used by experiments operation teams, sites, users, WLCG

WLCG Transfer Dashboard 9

ATLAS Distributed Computing 10

Higgs Seminar 11

Architecture and Technologies 12

Unified Monitoring Architecture 13 Processing kafka Data Centres Data Sources Data Access Storage/Search WLCG Transport

z Metrics Manager ATLAS Rucio CRAB2 GOCDB Data Sources Flume AMQ Transport HDFS Oracle HDFS Oracle Spark Oracle PL/SQL ESPER Storage &Search Processing & Aggregation Lemon Agent XSLS FTS Servers DPM Servers XROOTD Servers Farmout Grid Control CMS Connect PANDA WMS ProdSys CRAB3 WM Agent Nagios VOFeed OIM REBUS AMQ Kafka AMQ GLED HTTP Collector SQL Collector MonaLISA Collector HTTP GET HTTP PUT ElasticSearch Oracle ElasticSearch Hadoop Jobs GNI ES Queries Oracle PL/SQL ESPER Spark Kibana Jupyter Zeppelin Dashboards (ED) Real Time (ED) Accounting (ED) API (ED) SSB (ED) SAM3 (ED) API (ED) Display Access Data mgmt and transfers Job Monitoring Infrastructure Monitoring Data Centres Monitoring Current Monitoring

z Metrics Manager ATLAS Rucio CRAB2 GOCDB Data Sources Flume Transport Hadoop HDFS Spark Storage &Search Processing & Aggregation Lemon Agent XSLS FTS Servers DPM Servers XROOTD Servers Farmout Grid Control CMS Connect PANDA WMS ProdSys CRAB3 WM Agent Nagios VOFeed OIM REBUS AMQ Kafka ElasticSearch Hadoop Jobs GNI Jupyter Zeppelin Grafana Data Access Other Unified Monitoring Kibana

Unified Data Sources 16 21/07/2016ASDF meeting FTS Data Sources Rucio XRootD Jobs … Lemon syslog app log DB HTTP feed AMQ Flume AMQ Flume DB Flume HTTP Flume Kafka sink Flume Log GW Flume Metric GW Logs Lemon metrics Transport Data is channeled via Flume, validated and modified if necessary Adding new Data Sources is documented and fairly simple

Unified Processing 17 Kafka cluster (buffering) * 21/07/2016ASDF meeting Processing (e.g. Enrich FTS transfer metrics with WLCG topology from AGIS/Gocdb) Transport Flume Kafka sink Flume sinks Data now 100 GB/day, at scale 500 GB/day Current retention period 12 h, at scale 24 h

Data Processing Stream processing Data enrichment Join information from several sources (e.g. WLCG topology) Data aggregation Over time (e.g. summary statistics for a time bin) Over other dimensions (e.g. compute a cumulative metric for a set of machines hosting the same service) Data correlation Advanced Alarming: detect anomalies and failures correlating data from multiple sources (e.g. data centre topology-aware alarms) Batch processing Reprocessing, data compression, reports Technologies: Reliable and scalable job execution (Spark), Job orchestration and scheduling (Marathon/Chronos), Lightweight and isolation deployment (Docker) 18 21/07/2016ASDF meeting

Unified Access 19 HDFS Elastic Search … Storage & Search Others 21/07/2016ASDF meeting Data Access Scripts CLI, API Multiple data access methods (dashboards, notebooks, CLI) Mainstream and evolving technology Flume sinks Reports Plots

Status and Plans 20

WLCG Monitoring Data Sources and Transport Moving all data via new transport (Flume, AMQ, Kafka) Storage and Search Data in ES and Hadoop Processing Doing aggregation and processing via Spark Display and reports Using only standard features of ES, Kibana, Spark, Hadoop Introduce notebooks (e.g. Zeppelin) and data discovery General Selecting technologies, learning on the job, looking for expertise Evolve interfaces (dashboards for users, shifters, experts, managers) 21

WLCG Monitoring 22

Data Centres Monitoring (metrics) Replacement of the Lemon Agents Mainstream technologies (e.g. collectd) Support legacy sensors Starting in 2016Q4 Update meter.cern.ch Kibana and Grafana dashboards Move to new central ES service Move to the Unified Monitoring 23

Data Centres Monitoring (logs) Currently collecting syslog plus several application logs (e.g. EOS, Squid, HC, etc.) More requests coming for storing logs (e.g. Castor, Tapes, FTS) Update the Logs Service to the Unified Monitoring For archive in HDFS For processing in Spark For visualization in ES and Kibana 24

Conclusions / Services Proposed Monitor, collect, visualize, process, aggregate, alarm Metrics and Logs Infrastructure operations and scale Helping and supporting Interfacing new data sources Developing custom processing, aggregations, alarms Building dashboards and reports 25

monit.cern.ch 26

Reference and Contact Dashboard Prototypes monit.cern.ch Feedback/Requests (SNOW) cern.ch/monit-support Early-Stage Documentation cern.ch/monitdocs 27 21/07/2016ASDF meeting

Backup Slides 28

29

30

31

32

36