Download presentation
Presentation is loading. Please wait.
Published byColleen Reeves Modified over 8 years ago
1
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS
2
9/05/2005Hepix 9-13/05/2005 Karlsruhe2 Outline Lemon Structure Deployment at CERN Use cases Alarms Web visualization Summary
3
9/05/2005Hepix 9-13/05/2005 Karlsruhe3 Lemon – LHC Era Monitoring Lemon is a software package containing tools for monitoring status and performance of computers: –Distributed monitoring system scalable to ~10k nodes –Provides active monitoring of software and hardware in the Computer Center on centrally managed clusters –Facilitates early error detection and problem prevention –Provides persistent storage of the monitoring data –Executes corrective actions and send notifications –Offers a framework for further creation of sensors for monitoring –Most of the functionality is site independent It is used at CERN by: –System administrators, service managers, cluster responsibles –Developers and service/data challenges –Managers and general users Link: http://cern.ch/lemon
4
9/05/2005Hepix 9-13/05/2005 Karlsruhe4 Lemon - schema Correlation Engines Web browser Lemon CLI User Monitoring Repository TCP/UDP SOAP Repository backend SQL Nodes Monitoring Agent Sensor RRDTool / PHP apache HTTP
5
9/05/2005Hepix 9-13/05/2005 Karlsruhe5 Components MSA – Monitoring Sensor Agent –Spawns multiple Monitoring Sensors (MS) to measure data in defined intervals and sends data to Monitoring Repository MS - Monitoring Sensor –Uses standard C++, perl API – it is easy to write your own sensor –Several sensors exist for performance, process, hw and sw monitoring, grid VO’s job reporting, database monitoring, security, alarms (total 260 metrics) MR – Monitoring Repository –Stores data in an Oracle (the full history) – backed up to tape in Castor –Flat file version available as well (with most functionality preserved) –We run two of them on two independent machines with two databases with failover (aiming for High Availability with Oracle Real Application Cluster) LRF - Lemon RRD Framework –is used to cache the data in easily accessible way (rrd files) for web graphics –In connection with Quattor Configuration DB provides service and cluster overview –RRD stands for Round Robin Database (time aging data with predefined binning) – developed by Tobias Oetiker in ETH, Zurich (http://www.rrdtool.org) LAG – Lemon Alarm Gateway –Generic gateway for alarms
6
9/05/2005Hepix 9-13/05/2005 Karlsruhe6 Lemon at CERN Lemon monitors about 2200 computers in ~100 clusters On average it collects about 70 metrics from each host Part of the ELFms tools Integrated with Sure alarm system Collecting about 1.5 GB/day Integrated with CDB for configuration Leaf ( LHC-Era Automated Fabric) for scheduling of interventions Node Configuration Management Node Management Configuration Derived from Configuration Database (CDB) individual configuration per cluster/host hierarchical structure monitoring state is derived from CDB Leaf tools allow scheduled downtimes, interventions, on demand changes Alarm system Sure – legacy system receiving alarms from Lemon Integration with new LASER system (LHC alarm system) is ongoing
7
9/05/2005Hepix 9-13/05/2005 Karlsruhe7 Computer Center Overview Entry page displays status overview of the key services Allows choosing the individual cluster, rack, host or other categories
8
9/05/2005Hepix 9-13/05/2005 Karlsruhe8 Use(ful) cases (I) Kernel upgrade –Kernel version is “measured” on the boot of the machine –Automatic tools for upgrading the kernel on a cluster retrieve information from Lemon and schedule reboot of a machine based on this info –Web interface allows monitoring of the progress Reboot occurrence history graph
9
9/05/2005Hepix 9-13/05/2005 Karlsruhe9 Use(ful) case (II) Searching for a host –High load, network usage,… –Metric distributions allow identification of hosts with problematic performance
10
9/05/2005Hepix 9-13/05/2005 Karlsruhe10 Integration of Web interface Web interface has been through various plug-ins adopted to accommodate additional information/links to help management of the computer center Examples: –Configuration database browser (browses external XML config files) –ITCM (Remedy) ticket – external error tracking database –CC tracker (synoptic view of the computer center) – XML defined geometry –Alarm display –Metric information display –Raw data grapher (JPgraph) External functionalities are customizable
11
9/05/2005Hepix 9-13/05/2005 Karlsruhe11 Computer Center display Lemon Web Interface is interfaced with Computer Center database of objects Provides search of objects as well as listing Interfaced through the XML defined geometry of the computer center Generic design
12
9/05/2005Hepix 9-13/05/2005 Karlsruhe12 Automatic recovery actions Alarm Sensor –For defined values of measured metrics an actuator is called with predefined action –An example: ssh daemon dead – action /sbin/service sshd start –Definition: metric X, field Y != reference value Z => call actuator If success log only Else call action up to max times –Each occurrence is logged in the Monitoring Repository –Already about 70 predefined alarms with automatic recovery actions –After first month of deployment it reduced number of problem tickets by half Correlation engine –Allows wide definition of alarms and recovery actions (in development)
13
9/05/2005Hepix 9-13/05/2005 Karlsruhe13 Remedy Ticket tracking Error trending metric with values on number of interventions/occurrences of problems –Several categories created by: Hardware Software –Clustered by contract type/cluster –Reporting problems whether scheduled or not and whether system was rebooted –Allows tracking of interventions per type of problem –Web interface to show the trend ITCM (Remedy) tickets occurrence
14
9/05/2005Hepix 9-13/05/2005 Karlsruhe14 Database (Oracle) Monitoring In cooperation with ADC group at CERN we have developed a sensor for measuring performance entities in Oracle Database: –Number of logons, cursors, logical and physical I/O, user commits, index usage, parse statistics, … Allows identification of bottlenecks and gives overview of the stability of the system Works on both 9i and 10g version of the Oracle Integration into services/RAC Configuration of service integrated with Oracle Enterprise Repository
15
9/05/2005Hepix 9-13/05/2005 Karlsruhe15 Service challenges, GRID VOs Lemon allows –Virtual clusters clusters defined on request by service managers Or defined by scripts – updated dynamically on demand Or Defined for specific purpose An example: Atlas DC04 challenge, Network challenges,… –Clusters defined dynamically An example: hosts running GRID jobs on the batch cluster belonging to the given Virtual Organization Provides hooks in Lemon for defining any dynamic grouping of hosts
16
9/05/2005Hepix 9-13/05/2005 Karlsruhe16 Summary Lemon serves to provide monitoring information about the computers in the Computer Center at CERN Thanks to its integration with Sure (alarm system) it allows fast and easy identification and repair of problems. We will convert to a new accelerator alarm system this year (LASER). Lemon provides LAG (Lemon Alarm Gateway) to feed alarms into arbitrary alarm systems. In connection to CDB it allows easier overview of services and visualisation of their performance In connection to Remedy (ITCM – problem tracking) allows an overview of the problems for the given service It has been a useful tool for general monitoring of performance and also for system administrators in debugging problems Lemon is also used and developed elsewhere – BARC institute in India, Accelerator department at CERN, CMS is adopting it for its online farm monitoring,… Lemon is used for GridIce and can provide data to MonAlisa
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.