Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon for Quattor I.Fedorko CERN CF/IT 16 March 2011.

Slides:



Advertisements
Similar presentations
How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters.
Advertisements

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Connect. Communicate. Collaborate Click to edit Master title style MODULE 1: perfSONAR TECHNICAL OVERVIEW.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Post-C5 Lemon-web 2.0 Daniel Lenkes and Ivan Fedorko.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
NGOP J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
October 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
1 Doctor Fault Management 18 May 2015 Ryota Mibu, NEC.
Understanding and Managing WebSphere V5
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Monitoring Scale-Out with the MySQL Enterprise Monitor Andy Bang Lead Software Engineer MySQL-Sun, Enterprise Tools Team Wednesday, April 16, :15.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
TechEd /22/2017 5:40 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
AI project components: Facter and Hiera
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Messaging System Ivan, Omar, Sergio 14 march 2012.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
National Center for Supercomputing Applications NCSA OPIE Presentation November 2000.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
CERN - IT Department CH-1211 Genève 23 Switzerland t DB Development Tools Benthic SQL Developer Application Express WLCG Service Reliability.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
CERN IT Department CH-1211 Genève 23 Switzerland t IT Monitoring WG IT/CS Monitoring System Virginie Longo September 14th 2011.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Using Lemon: the ATLAS use cases F. Donno. Monitoring used partitions (1) variable data_full = default("noalarm"); variable tmp_full = default("noalarm");
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
CERN IT Department CH-1211 Genève 23 Switzerland t DM Database Monitoring Tools Database Developers' Workshop CERN, July 8 th, 2008 Dawid.
CERN IT Department CH-1211 Genève 23 Switzerland t DBA Experience in a multiple RAC environment DM Technical Meeting, Feb 2008 Miguel Anjo.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Lemon Tutorial Sensor Exception Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Hyperion Artifact Life Cycle Management Agenda  Overview  Demo  Tips & Tricks  Takeaways  Queries.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
Oct HPS Collaboration Meeting Jeremy McCormick (SLAC) HPS Web 2.0 OR Web Apps and Databases (Oh My!) Jeremy McCormick (SLAC)
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal at CERN Juraj Sucik Jarosław Polok.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
Lemon Tutorial Sensor How-To Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
The Holmes Platform and Applications
Simulation Production System
System Monitoring with Lemon
Status of Fabric Management at CERN
Open Source distributed document DB for an enterprise
Miroslav Siket, Dennis Waldron
PHP / MySQL Introduction
How to monitor the $H!T out of Hadoop
Presentation transcript:

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon for Quattor I.Fedorko CERN CF/IT 16 March 2011

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Overview Lemon Monitoring System Lemon in use –focus on CERN Computer Centre Current activities: –Federation –Monitoring of Virtualization –New Lemon Web Future perspective Monitoring Questionnaire 2

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon Monitoring Agent based monitoring: agent collects data on the node data stored in the central repository data visualization  SQL TCP/UDPHTTP Sensor Monitoring Agent Local Cache Flat-file/Oracle Database Repository Backend Application Server Lemon CLI Lemon-host-check Web Browser RRD tool / Python Apache/ PHP (command line tool to access data) (command line tool node exceptions) Measurement Repository User InterfacesNode Monitoring 3

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon agent & sensor Class Instance Class Instance Monitoring Agent Class Instance Metric Class Metric Class Sensor Metric Class Metric Class Sensor Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance MSA (Monitoring Sensor Agent) forks sensors and communicate with them using custom protocol over a bi-directional “pipes” checks on status of sensors caches data locally ( e.g. /var/spool/lemon-agent/ ) Sensor: process or script connected to the lemon-agent via a bi- directional pipe collects information on behalf of the agent 4

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon sensors Example of linux sensor for (i = 0, j = 0; i <= strlen(model); i++) { if (model[i] == ' ') r_model[j]= '_'; else r_model[j]=model[i]; j++; } bogomips = bogomips * cpus; /* store sample */ snprintf(value, sizeof(value) - 1, "%s %.0f %d", vendor,clockspeed,logical); storeSample01(value); Lemon API (perl, c++, for python a wrapper) 5

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Exception sensor –an officially supported Lemon sensor coded in C++ Full documentation at: – 6 sensor on the node local cache Metric aa1 Metric bb2 Metric cc3 sensor exception sensor if ( aa1>0 && bb2=‘string’ || cc3!= false) Actuator : command/script Metric ee1

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Exception Sensor Objective: –To run corrective action when the occupancy of the /tmp partition is greater then 80% Involved Metrics –With ID 9104 (system.partitionInfo) –Field 1 = mountname, field 5 = percentage occupancy Correlation Correlation ((9104:1 eq '/tmp') && (9104:5 > 80)) Actuator /usr/local/sbin/clean-tmp-partition -o 75 lemon-cli -m 9104 [INFO] lemon-cli version started by ifedorko at Thu Dec 16 09:06: on lxadm11.cern.ch [VERB] Nodes: lxadm11 [VERB] Metrics: 9104 [VERB] Method: Latest [VERB] [INFO] local: lxadm Thu Dec 16 09:02: / ext3 rw [INFO] local: lxadm Thu Dec 16 09:02: /boot ext3 rw [INFO] local: lxadm Thu Dec 16 09:02: /tmp ext3 rw [INFO] local: lxadm Thu Dec 16 09:02: /usr/vice/cache ext3 rw [INFO] local: lxadm Thu Dec 16 09:02: /var ext3 rw [VERB] [VERB] Total: 5 results 7

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon instances –CNAF, Morgan Stanley (Oracle backend) –CERN: Alice DAQ, LHC control (flat-file backend) Computer Center (Oracle backend) –Lemon Alarm System –Service Level status 8 Missing MySQL? Current development is reflecting DB flavor independent policy  to implement functionality is up to the external user

CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC ~11k monitored entities (~8k nodes) –performance monitoring CPU Load, partitions information –application monitoring File, log parsing –power, temperature –remote (ping, http, snmp) 5 core sensors covering ~60% of performance and application monitoring ~30 misc sensor –hw_scan, snmp, castor >5000 nodes with metrics ~1.7 M monitored ‘services’across CERN CC Oracle backend –throughput 300GB/month # of sensors # of nodes 9

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Quattor managed Lemon node template add castor-sensor rpm include castor-sensor.tpl include castor-metrics.tpl Lemon system template include xyz-sensor.tpl …. Include xyz-metrics.tpl …. castor-sensor.tpl castor-metrics.tpl CDB Administrator User node ncm-fmonagent server meta-data to check data before commit admin of DB tables 10 Sindes confidential config files node certificate

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Exception Metrics Event Management Lemon-web LAS GUI Lemon Oracle DB LAS Business Logic PL/SQL Operator Administrator 11

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Service Level SLS-web USER Scripts XML Service DB RRD test/probes Lemon DB SLS XML 12 Used by RAL and MS

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Current activities Federation of instances Monitoring New Lemon-web 13

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Current activities: Federation Developed with MS search over all instances federated cluster from all measurement repositories Federated Lemon-web Measurement Repository Measurement Repository Lemon-web search over entities grouping entities rrd file/entity Lemon-web search over entities grouping entities rrd file/entity 14

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Current activities: Virtualization Quattor managed nodes are authenticated in Lemon by Sindes certificates –Lemon fetches certificates from Sindes virtualized batch nodes generated from golden node  golden node is Quattor managed  ‘golden’ certificate is shared by generated nodes addressed by lemon admin tool  no new code needed monitoring of hypervisors addressed by customized sensors More details: Don’t hesitate to contact 15

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Current activities: New Lemon Web 16 List of all metrics on host and latest values stored in central DB metrics raw data graphing CDB templates host exceptions & alarms alarms RSS

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Current activities: New Lemon Web Lemon-web lemonmrd Configuration RRD files file(with all metrics)/node DB table/metric CDB It has two parts: Lemonmrd Lemon-web Retrieve, Store, Display information 17

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Current activities: New Lemon Web Lemon-mrd –enhanced configuration –new math operations (-, *, / ) for dynamic cluster data aggregation in current version only sum and average –data aggregation from multiple DB backends Lemon-web –flexible menu structure –personalized views –url pointing at graphs, can be embedded in any pages –connecting to multiple DB sources Demo available on demand 18

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Current activities: New Lemon Web Graph reliability: How many entities are reported from the expected? 19

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Close future Consolidation of monitoring CERNIT Department  grid, windows, computer centre, experiment support, etc. 20 Lemon enhancements under consideration New Lemon DB schema Lemon repository data export export to data files Service monitoring with alarming and classification Remote instances Interfacing with Nagios  any experience to share is welcome Interfacing with Windows monitoring (both sides)

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Quattor 3 answers: instance size: nodes preferred solution: Ganglia + Nagios effort 0.2 to 0.5 FTE –additional effort for sensor/probes enhancement alarming by Nagios up to 15GB of data (per year ?) no history beyond rrd Lemon used for raid controllers (Belgium grid) 21

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Backup From now on backup 22

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Exception sensor –an officially supported Lemon sensor coded in C++ Main characteristic –access agent local cache –correlation engine allows to evaluate 1 or more metrics values to detect problem on a node –supports reporting on behalf of other monitored entities –allows corrective actions  actuators –output is an exception Full documentation at: –

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Services in Lemon Host 1 Application A HW scan Host 2 Application B HW scan CPU load partitions occupancy is app running log parsing X log parsing Y SMART IPMI CPU load partitions occupancy is app running log parsing SMART IPMI Except. 1 Except. 2 Except. 3 Except. 4 Except. 5 Current LAS view Service alarms Sys-admin alarms 24 Resource utilization by service Castor load by experiment Run-time data aggregation CPU load over cluster

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Services in Lemon Number of registered metrics during last year Host 1 Application A HW scan Host 2 Application B HW scan CPU load partitions occupancy is app running log parsing X log parsing Y SMART IPMI CPU load partitions occupancy is app running log parsing SMART IPMI Except. 1 Except. 2 Except. 3 Except. 4 Except. 5 Current LAS view Service alarms Sys-admin alarms 25

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Available sensors MSA - checks the health of the Lemon sensor agent (built in)MSA Linux - provides standard performance monitoring of the systemLinux file - provides various file-related utilitiesfile parse-log - provides log parsing metricparse-log exception - exception handling with support for correlation between metrics.exception remote (ping, http), SURE, DIP, PLC Supported Lemon Sensors Misc Lemon Sensors Hwprobpe (memory error) Hwscan (scan hw and compare with CDB conf) ipmi (measured locally) Snmp-get (no for trap because async alarm not supported) 26

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Exception sensor –an officially supported Lemon sensor coded in C++ –developed in collaboration between CERN and BARC –implements the Lemon alarm protocol Main characteristic –access agent local cache –correlation engine allows to evaluate 1 or more metrics values to detect problem on a node –supports reporting on behalf of other monitored entities –allows corrective actions  actuators –output is an exception Full documentation at: – 27

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Actuator "/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) ); Actual value of correlation objects are accessible for actuator /bin/sh -c \\"/bin/echo '$act_value_01 $act_value_02' \\" /bin/sh -c \\"/bin/echo 'Died lemonmrd daemon $act_value_01 ' | /bin/mail -s 'Lemon RRD Daemon problem' Information: –Run as forked processes. –Are connected to the sensor via a pipe. –All information written to stdout or stderr by the actuator is caught and recorded in the agents log file. Running shell style actuators: –The system call used to run actuator doesn’t provide shell style conveniences. –To use shell style syntax like *, &&, | etc you must define you actuator like this: 28

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Correlation 1.Use basic object for correlation : 2. Combine exceptions 10004:1 > 600 && (10004:7 > 10 || (10004:8 > && 4109:3 eq 'i386') || (10004:8 > && 4109:3 regex '64') || 10007:2 > 50 || 10007:3 > 10 || 10007:4 > 0) 4. Join the metrics (9200:1 == 9208:1) 3.You can collect information on behalf, you can define exception on behalf [entity_name]]: : e.g. (*:9501:5 != 200) && (*:9501:5 != 301) 5. "correlation", "4109:2 ne 'symlink('/system/kernel/version')'" 29