Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.

Slides:



Advertisements
Similar presentations
TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.
Advertisements

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
CCTracker Presented by Dinesh Sarode Leaf : Bill Tomlin IT/FIO URL
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
NGOP J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
The CERN Computer Centres October 14 th 2005 CERN.ch.
Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
Understanding and Managing WebSphere V5
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
International Workshop on Large Scale Computing, VECC, Kolkata, Feb 8-10, LCG Software Activities in India Rajesh K. Computer Division BARC.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
CERN - IT Department CH-1211 Genève 23 Switzerland t DB Development Tools Benthic SQL Developer Application Express WLCG Service Reliability.
Fermilab Distributed Monitoring System (NGOP) Progress Report J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
RRDtool Miroslav Siket FIO-FS /
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon for Quattor I.Fedorko CERN CF/IT 16 March 2011.
INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Performance Testing Test Complete. Performance testing and its sub categories Performance testing is performed, to determine how fast some aspect of a.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Lemon Tutorial Sensor How-To Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Online Software November 10, 2009 Infrastructure Overview Luciano Orsini, Roland Moser Invited Talk at SuperB ETD-Online Status Review.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
System Monitoring with Lemon
Status of Fabric Management at CERN
LEMON – Monitoring in the CERN Computer Centre
Miroslav Siket, Dennis Waldron
Maximum Availability Architecture Enterprise Technology Centre.
Database Services for CERN Deployment and Monitoring
Presentation transcript:

Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS

9/05/2005Hepix 9-13/05/2005 Karlsruhe2 Outline Lemon Structure Deployment at CERN Use cases Alarms Web visualization Summary

9/05/2005Hepix 9-13/05/2005 Karlsruhe3 Lemon – LHC Era Monitoring Lemon is a software package containing tools for monitoring status and performance of computers: –Distributed monitoring system scalable to ~10k nodes –Provides active monitoring of software and hardware in the Computer Center on centrally managed clusters –Facilitates early error detection and problem prevention –Provides persistent storage of the monitoring data –Executes corrective actions and send notifications –Offers a framework for further creation of sensors for monitoring –Most of the functionality is site independent It is used at CERN by: –System administrators, service managers, cluster responsibles –Developers and service/data challenges –Managers and general users Link:

9/05/2005Hepix 9-13/05/2005 Karlsruhe4 Lemon - schema Correlation Engines Web browser Lemon CLI User Monitoring Repository TCP/UDP SOAP Repository backend SQL Nodes Monitoring Agent Sensor RRDTool / PHP apache HTTP

9/05/2005Hepix 9-13/05/2005 Karlsruhe5 Components MSA – Monitoring Sensor Agent –Spawns multiple Monitoring Sensors (MS) to measure data in defined intervals and sends data to Monitoring Repository MS - Monitoring Sensor –Uses standard C++, perl API – it is easy to write your own sensor –Several sensors exist for performance, process, hw and sw monitoring, grid VO’s job reporting, database monitoring, security, alarms (total 260 metrics) MR – Monitoring Repository –Stores data in an Oracle (the full history) – backed up to tape in Castor –Flat file version available as well (with most functionality preserved) –We run two of them on two independent machines with two databases with failover (aiming for High Availability with Oracle Real Application Cluster) LRF - Lemon RRD Framework –is used to cache the data in easily accessible way (rrd files) for web graphics –In connection with Quattor Configuration DB provides service and cluster overview –RRD stands for Round Robin Database (time aging data with predefined binning) – developed by Tobias Oetiker in ETH, Zurich ( LAG – Lemon Alarm Gateway –Generic gateway for alarms

9/05/2005Hepix 9-13/05/2005 Karlsruhe6 Lemon at CERN Lemon monitors about 2200 computers in ~100 clusters On average it collects about 70 metrics from each host Part of the ELFms tools Integrated with Sure alarm system Collecting about 1.5 GB/day Integrated with CDB for configuration Leaf ( LHC-Era Automated Fabric) for scheduling of interventions Node Configuration Management Node Management Configuration Derived from Configuration Database (CDB) individual configuration per cluster/host hierarchical structure monitoring state is derived from CDB Leaf tools allow scheduled downtimes, interventions, on demand changes Alarm system Sure – legacy system receiving alarms from Lemon Integration with new LASER system (LHC alarm system) is ongoing

9/05/2005Hepix 9-13/05/2005 Karlsruhe7 Computer Center Overview Entry page displays status overview of the key services Allows choosing the individual cluster, rack, host or other categories

9/05/2005Hepix 9-13/05/2005 Karlsruhe8 Use(ful) cases (I) Kernel upgrade –Kernel version is “measured” on the boot of the machine –Automatic tools for upgrading the kernel on a cluster retrieve information from Lemon and schedule reboot of a machine based on this info –Web interface allows monitoring of the progress Reboot occurrence history graph

9/05/2005Hepix 9-13/05/2005 Karlsruhe9 Use(ful) case (II) Searching for a host –High load, network usage,… –Metric distributions allow identification of hosts with problematic performance

9/05/2005Hepix 9-13/05/2005 Karlsruhe10 Integration of Web interface Web interface has been through various plug-ins adopted to accommodate additional information/links to help management of the computer center Examples: –Configuration database browser (browses external XML config files) –ITCM (Remedy) ticket – external error tracking database –CC tracker (synoptic view of the computer center) – XML defined geometry –Alarm display –Metric information display –Raw data grapher (JPgraph) External functionalities are customizable

9/05/2005Hepix 9-13/05/2005 Karlsruhe11 Computer Center display Lemon Web Interface is interfaced with Computer Center database of objects Provides search of objects as well as listing Interfaced through the XML defined geometry of the computer center Generic design

9/05/2005Hepix 9-13/05/2005 Karlsruhe12 Automatic recovery actions Alarm Sensor –For defined values of measured metrics an actuator is called with predefined action –An example: ssh daemon dead – action /sbin/service sshd start –Definition: metric X, field Y != reference value Z => call actuator If success log only Else call action up to max times –Each occurrence is logged in the Monitoring Repository –Already about 70 predefined alarms with automatic recovery actions –After first month of deployment it reduced number of problem tickets by half Correlation engine –Allows wide definition of alarms and recovery actions (in development)

9/05/2005Hepix 9-13/05/2005 Karlsruhe13 Remedy Ticket tracking Error trending metric with values on number of interventions/occurrences of problems –Several categories created by: Hardware Software –Clustered by contract type/cluster –Reporting problems whether scheduled or not and whether system was rebooted –Allows tracking of interventions per type of problem –Web interface to show the trend ITCM (Remedy) tickets occurrence

9/05/2005Hepix 9-13/05/2005 Karlsruhe14 Database (Oracle) Monitoring In cooperation with ADC group at CERN we have developed a sensor for measuring performance entities in Oracle Database: –Number of logons, cursors, logical and physical I/O, user commits, index usage, parse statistics, … Allows identification of bottlenecks and gives overview of the stability of the system Works on both 9i and 10g version of the Oracle Integration into services/RAC Configuration of service integrated with Oracle Enterprise Repository

9/05/2005Hepix 9-13/05/2005 Karlsruhe15 Service challenges, GRID VOs Lemon allows –Virtual clusters clusters defined on request by service managers Or defined by scripts – updated dynamically on demand Or Defined for specific purpose An example: Atlas DC04 challenge, Network challenges,… –Clusters defined dynamically An example: hosts running GRID jobs on the batch cluster belonging to the given Virtual Organization Provides hooks in Lemon for defining any dynamic grouping of hosts

9/05/2005Hepix 9-13/05/2005 Karlsruhe16 Summary Lemon serves to provide monitoring information about the computers in the Computer Center at CERN Thanks to its integration with Sure (alarm system) it allows fast and easy identification and repair of problems. We will convert to a new accelerator alarm system this year (LASER). Lemon provides LAG (Lemon Alarm Gateway) to feed alarms into arbitrary alarm systems. In connection to CDB it allows easier overview of services and visualisation of their performance In connection to Remedy (ITCM – problem tracking) allows an overview of the problems for the given service It has been a useful tool for general monitoring of performance and also for system administrators in debugging problems Lemon is also used and developed elsewhere – BARC institute in India, Accelerator department at CERN, CMS is adopting it for its online farm monitoring,… Lemon is used for GridIce and can provide data to MonAlisa