Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005
25/05/2005LCG Operations Workshop /05/2005 Bologna 2 Outline Lemon Structure and design How it works, deployment Use cases, web interface Installation and setup Summary
25/05/2005LCG Operations Workshop /05/2005 Bologna 3 Lemon – LHC Era Monitoring Lemon is a system containing tools for monitoring status and performance of computers: –Distributed monitoring system scalable to ~10k nodes –Provides active monitoring of software and hardware in the Computer Center on centrally managed clusters –Facilitates early error detection and problem prevention –Executes corrective actions and sends notifications –Provides persistent storage of the monitoring data –Offers a framework for further creation of sensors for monitoring –Site independent functionality Link: Part of the ELFms toolsuite:
25/05/2005LCG Operations Workshop /05/2005 Bologna 4 Lemon Use It is used in-and-outside CERN by: –System administrators, service managers, cluster responsibles –Developers and service/data challenges –Managers and general users Deployments outside CERN : –EDG testbeds –Accelerator (AB) department at CERN –CMS online –GridICE –BARC India (development partner)
25/05/2005LCG Operations Workshop /05/2005 Bologna 5 Lemon architecture Correlation Engines Web browser Lemon CLI User Monitoring Repository TCP/UDP SOAP Repository backend Prot Nodes Monitoring Agent Sensor RRDTool / PHP apache HTTP
25/05/2005LCG Operations Workshop /05/2005 Bologna 6 Components Lemon is a typical server/client application with following components: –MSA – Monitoring Sensor Agent (Lemon Agent) Daemon on a client machine that spawns multiple Monitoring Sensors to measure data in defined intervals and sends data to Monitoring Repository –MS - Monitoring Sensor Uses standard C++, perl API – it is easy to write your own sensor Several sensors exist for performance, process, hw and sw monitoring, grid VO’s job reporting, database monitoring, security, alarms (total 260 metrics) –MR – Monitoring Repository Server application that receives samples and processes/validates them Stores the full monitoring history data Two implementations - flat files or Oracle DB based –LRF - Lemon RRD Framework Pre-processes data into rrd files and creates cluster summaries These are used for web graphics Provides service and cluster overview in its web displays –LAG – Lemon Alarm Gateway Generic gateway for alarms (in development) Gateways to MonALISA and GridICE exist
25/05/2005LCG Operations Workshop /05/2005 Bologna 7 Lemon at CERN Lemon monitors about 2200 computers in ~100 clusters On average it collects about 70 metrics from each host Integrated with Sure alarm system Collecting about 1.5 GB/day LEAF (LHC-Era Automated Fabric) for high-level intervention scheduling Node Configuration Management Node Management Configuration Derived from the Quattor Configuration Database (CDB) individual configuration per cluster/host hierarchical structure Alarm system Sure – legacy system receiving alarms from Lemon Integration with new LASER system (LHC alarm system) via LAG is ongoing
25/05/2005LCG Operations Workshop /05/2005 Bologna 8 Web interface Cluster view displays accumulated statistics and status for all machines in the cluster Host view gives overview of the host status with basic metrics Other views available: –Rack view –Hardware type view –Other views can be added, working on user defined views With the newest version (to be released soon): –Generic entry page displaying status overview of the key services –Configurable views In development: database services monitoring with database specific view
25/05/2005LCG Operations Workshop /05/2005 Bologna 9 Use(ful) case Kernel upgrade –Kernel version is “measured” on the boot of the machine –Automatic tools for upgrading the kernel on a cluster retrieve information from Lemon and schedule reboot of a machine based on this info –Web interface allows monitoring of the progress Reboot occurrence history graph
25/05/2005LCG Operations Workshop /05/2005 Bologna 10 Computer Center display Lemon Web Interface can be interfaced with a Computer Center database of objects (racks, silos, …) Provides search of objects as well as listing Interfaced through a XML defined geometry of the computer center Generic design that can be used anywhere:
25/05/2005LCG Operations Workshop /05/2005 Bologna 11 Service challenges, GRID VOs Lemon allows for –Virtual clusters clusters defined on request by service managers or defined by scripts – updated dynamically on demand or defined for specific purpose Examples: Alice MDC, network challenges,… –Clusters defined dynamically example: hosts running GRID jobs on the batch cluster belonging to the given Virtual Organization hooks in Lemon for defining any dynamic grouping of hosts
25/05/2005LCG Operations Workshop /05/2005 Bologna 12 Automatic recovery actions and Alarms Alarm Sensor –For defined values of measured metrics an actuator is called with predefined action –An example: ssh daemon dead – action /sbin/service sshd start –Definition: metric X, field Y reference value Z => call actuator can be ==,,regexp, range, etc.. If success log only, else call action up to max times –Each occurrence is logged in the Monitoring Repository –Already about 70 predefined alarms with automatic recovery actions –After first month of deployment it reduced number of problem tickets by half Correlation engine (CMDaemon) –Allows ‘global’ correlations, and in the future client/server alarms and recovery actions Lemon Alarm gateway (LAG) –Lemon’s LAG can be used to feed alarms into arbitrary alarm systems (under development)
25/05/2005LCG Operations Workshop /05/2005 Bologna 13 Installation and setup (I) Lemon installation consists of three steps: 1.Server installation 2.Client installation 3.Web interface installation 1. Server installation: –install edg-fabricMonitoring-server rpm (“flat file” server) –Configure receiving port in /etc/edg-fmon-server.conf –Start the server daemon 2. Client installation: –Install edg-fabricMonitoring-agent rpm (comes with default metric configuration) –Configure server and its port in /etc/edg-fmon-agent.conf –Start the client daemon on all monitored hosts
25/05/2005LCG Operations Workshop /05/2005 Bologna 14 Installation and setup (II) 3. Web interface installation –Install and start apache server (with php) on your server –Install rrdtool and lrf (lemon rrd framework) rpms –Configure your clusters in clusters.conf file and start lemonmrd daemon Drink Champagne… you have Lemon up and running! ;-) –You can do all this on your laptop! Possible additional components: –Computer center synoptic view through xml file –Problem tracking system integration (through php plug-in to your DB/application) –Quattor CDB configuration view – through CDB xml profiles –Oracle based Repository (for very large installations with high scalability and increased functionality) –Other, new components are easy to add View detailed instructions at:
25/05/2005LCG Operations Workshop /05/2005 Bologna 15 Summary Lemon serves to provide monitoring information about the farms in Computer Centers (or your laptop). Lemon provides framework for recovery actions and alarms. Lemon is easy to install (…and it is easy to add your own metrics and visualize them). It is flexible with respect to your needs – you can add clusters, views, specify your definition of virtual and dynamic clusters. It has been a useful tool for general monitoring of performance and also for system administrators in debugging problems. For more information check