IT Monitoring Service Status and Progress

Slides:

Advertisements

Similar presentations

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Advertisements

Testing as a Service with HammerCloud Ramón Medrano Llamas CERN, IT-SDC

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Performance and Exception Monitoring Project Tim Smith CERN/IT.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Input from CMS Nicolò Magini Andrea Sciabà IT/SDC 5 July 2013.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.

WLCG Transfers Dashboard A unified monitoring tool for heterogeneous data transfers. Alexandre Beche.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.

Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.

CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.

WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.

Monitoring Evolution 1 Alberto AIMAR, IT-CM-MM. Outline Mandate Data Centres Monitoring Experiments Dashboards Architecture Plans Status Demo 2.

IT Monitoring Service Status and Progress 1 Alberto AIMAR, IT-CM-MM.

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Kibana, Grafana and Zeppelin on Monitoring data

Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.

Experiments Monitoring Plans and Progress

Daniele Bonacorsi Andrea Sciabà

Connected Infrastructure

CERN Data Analytics Use Cases

WLCG IPv6 deployment strategy

Monitoring Evolution and IPv6

WLCG Workshop 2017 [Manchester] Operations Session Summary

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

Update on CERN IT Unified Monitoring Architecture (UMA)

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Data Analytics and CERN IT Hadoop Service

Hadoop and Analytics at CERN IT

Evolution of tools for WLCG operations Julia Andreeva, CERN IT

Key Activities. MND sections

ATLAS Cloud Operations

ALICE Monitoring

POW MND section.

CWG10 Control, Configuration and Monitoring

Spark Presentation.

DI4R, 30th September 2016, Krakow

Experiment Dashboard overviw of the applications

New Big Data Solutions and Opportunities for DB Workloads

1 VO User Team Alarm Total ALICE ATLAS CMS

Connected Infrastructure

A Messaging Infrastructure for WLCG

Monitoring Of XRootD Federation

WLCG Collaboration Workshop;

Monitoring of the infrastructure from the VO perspective

Data Analytics – Use Cases, Platforms, Services

5 Azure Services Every .NET Developer Needs to Know

Building a minimum viable Security Operations Centre

Presentation transcript:

IT Monitoring Service Status and Progress Alberto AIMAR, IT-CM-MM

Outline Monitoring Data Centres Experiments Dashboards Architecture and Technologies Status and Plans

Monitoring Data Centre Monitoring Experiment Dashboards Monitoring of DC at CERN and Wigner Hardware, operating system, and services Data Centres equipment (PDUs, temperature sensors, etc.) Used by service providers in IT, experiments Experiment Dashboards Sites availability, data transfers, job information, reports Used by WLCG, experiments, sites and users Both hosted by CERN IT, in different teams

Mandate Focus for 2016 Regroup monitoring activities hosted by CERN/IT (Data Centres, Experiment Dashboards, ETF, HammerCloud, etc) Continue existing services Uniform with CERN IT practices Management of services, communication, tools (e.g. GGUS and SNOW tickets) Starting with Merge Data Centres and Experiment Dashboards monitoring technologies Review existing monitoring usage and needs (DC, WLCG, etc) Investigate new technologies Unchanged support while collecting feedback and working

Data Centres Monitoring

Data Centres Monitoring

Experiment Dashboards Operation Teams Sites Analysis + Production Real time and Accounting views Data transfer Data access Site Status Board SAM3 Google Earth Dashboard Sites Data Management Monitoring General Public Users Outreach Job Monitoring Infrastructure Monitoring Operation Teams Operation Teams Experiment Dashboard covers the full-range of experiments’ computing activities Provides information to different categories of users Sites Sites 300-500 users per day

Experiment Dashboards Job monitoring, sites availability, data management and transfers Used by experiments operation teams, sites, users, WLCG

WLCG Transfer Dashboard

ATLAS Distributed Computing

Higgs Seminar

Architecture and Technologies

Unified Monitoring Architecture Data Sources Storage/Search Transport Data Access Data Centres Processing WLCG kafka

Processing & Aggregation Current Monitoring Data Sources z Transport Storage &Search Display Access Data Centres Monitoring Metrics Manager Flume HDFS Kibana Lemon Agent AMQ ElasticSearch Jupyter XSLS Kafka Oracle Zeppelin ATLAS Rucio ElasticSearch Data mgmt and transfers Flume FTS Servers HDFS AMQ DPM Servers Oracle GLED Dashboards (ED) XROOTD Servers ElasticSearch Kibana CRAB2 Oracle Zeppelin Monitoring Job CRAB3 ElasticSearch HTTP Collector WM Agent Processing & Aggregation SQL Collector Farmout Grid Control MonaLISA Collector Spark Real Time (ED) CMS Connect Hadoop Jobs Accounting (ED) PANDA WMS GNI API (ED) ProdSys Oracle PL/SQL Nagios ESPER Infrastructure Monitoring AMQ VOFeed Spark SSB (ED) HTTP GET OIM Oracle PL/SQL SAM3 (ED) HTTP PUT ES Queries GOCDB API (ED) ESPER REBUS

Processing & Aggregation Unified Monitoring Data Sources Transport z Storage &Search Data Access Metrics Manager Lemon Agent XSLS ATLAS Rucio FTS Servers Hadoop HDFS DPM Servers ElasticSearch XROOTD Servers Other Kibana CRAB2 Flume Grafana CRAB3 AMQ Jupyter WM Agent Kafka Processing & Aggregation Zeppelin Farmout Grid Control Other CMS Connect PANDA WMS Spark ProdSys Hadoop Jobs Nagios GNI VOFeed Other OIM GOCDB REBUS

Unified Data Sources AMQ Data Sources Flume AMQ Transport DB FTS Flume DB Flume Kafka sink Rucio XRootD HTTP feed Flume HTTP Jobs … Logs Flume Log GW Lemon syslog Lemon metrics Flume Metric GW app log Decouple producer/consumer, job/stores Data is channeled via Flume, validated and modified if necessary Adding new Data Sources is documented and fairly simple 21/07/2016 ASDF meeting

(e.g. Enrich FTS transfer metrics with WLCG topology from AGIS/Gocdb) Unified Processing Transport Flume Kafka sink Flume sinks Kafka cluster (buffering) * Processing (e.g. Enrich FTS transfer metrics with WLCG topology from AGIS/Gocdb) Decouple producer/consumer, job/stores Data now 100 GB/day, at scale 500 GB/day Current retention period 12 h, at scale 24 h 21/07/2016 ASDF meeting

Data Processing Stream processing Batch processing Data enrichment Join information from several sources (e.g. WLCG topology) Data aggregation Over time (e.g. summary statistics for a time bin) Over other dimensions (e.g. compute a cumulative metric for a set of machines hosting the same service) Data correlation Advanced Alarming: detect anomalies and failures correlating data from multiple sources (e.g. data centre topology-aware alarms) Batch processing Reprocessing, data compression, reports Technologies: Reliable and scalable job execution (Spark), Job orchestration and scheduling (Marathon/Chronos), Lightweight and isolation deployment (Docker) 21/07/2016 ASDF meeting

Unified Access Storage & Search Data Access HDFS Flume sinks ElasticSearch Plots Reports … Scripts Decouple producer/consumer, job/stores Others CLI, API Multiple data access methods (dashboards, notebooks, CLI) Mainstream and evolving technology 21/07/2016 ASDF meeting

Status and Plans

WLCG Monitoring Data Sources and Transport Storage and Search Moving all data via new transport (Flume, AMQ, Kafka) Storage and Search Data in ES and Hadoop Processing Doing aggregation and processing via Spark Display and reports Using only standard features of ES, Kibana, Spark, Hadoop Introduce notebooks (e.g. Zeppelin) and data discovery General Selecting technologies, learning on the job, looking for expertise Evolve interfaces (dashboards for users, shifters, experts, managers)

WLCG Monitoring

Data Centres Monitoring (metrics) Replacement of the Lemon Agents Mainstream technologies (e.g. collectd) Support legacy sensors Starting in 2016Q4 Update meter.cern.ch Kibana and Grafana dashboards Move to new central ES service Move to the Unified Monitoring

Data Centres Monitoring (logs) Currently collecting syslog plus several application logs (e.g. EOS, Squid, HC, etc.) More requests coming for storing logs (e.g. Castor, Tapes, FTS) Update the Logs Service to the Unified Monitoring For archive in HDFS For processing in Spark For visualization in ES and Kibana

Participating to ES Benchmarking ES Service is separate from Monitoring Service Looking to use "standard“ CERN solutions VMs, remote storage, etc Data flow started few days ago, will take some weeks Identical ES settings and version 3 master nodes, m2_medium sized VMs 1 search node, m2_large size 3 data nodes, providing the same amount of data. bench01 data nodes 3 full node m2 VMs (VM takes a full hypervisor) 600GB io2 sized volumes attached as storage backend (Ceph) loop back device in /var as fast storage (SSD) bench02 data nodes: 3 full node VMs running on bcache enabled hypervisors

Conclusions / Services Proposed Monitor, collect, visualize, process, aggregate, alarm Metrics and Logs Infrastructure operations and scale Helping and supporting Interfacing new data sources Developing custom processing, aggregations, alarms Building dashboards and reports

monit.cern.ch

Reference and Contact Dashboard Prototypes monit.cern.ch Feedback/Requests (SNOW) cern.ch/monit-support Early-Stage Documentation cern.ch/monitdocs 21/07/2016 ASDF meeting

Examples

Ildar Nurgaliev – CERN openlab Data Centre Overview Grafana Ildar Nurgaliev – CERN openlab 11/08/2016

Ildar Nurgaliev – CERN openlab FTS Monitoring Grafana Transfer Sites Dashboard Ildar Nurgaliev – CERN openlab 11/08/2016

Sample Plots - Zeppelin PyPlot for report- quality plots 0-access bins Old and new datasets Most accessed datasets Interactive built-in plots for discovery Ildar Nurgaliev – CERN openlab 11/08/2016

Ildar Nurgaliev – CERN openlab Zeppelin Notebooks Developer’s view User’s view Ildar Nurgaliev – CERN openlab 11/08/2016