Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview Lemon LAS SLS Recent and coming challenges
CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring We have monitoring tools Lemon LAS SLS new requirements are coming progressively environment is changing
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon Purpose: –CC monitoring (Linux boxes) Customers: –CF-SAO (sys-admin, operators), IT&PH (SM and VOC) What is monitored: –Performance monitoring (e.g. CPU load) –Application (e.g. log parsing) –Facilities monitoring (e.g. power, temperature) - recently added Main characteristics: –Agent based system –Corrective action performed on node Overlap: –Data collection
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon Needs to be addressed –Visualization currently under development (see Dec 2010 post C5) Designed for simple cluster-host hierarchy limit for facilities monitoring Scale limits Advance visualization out of design scope addressed by CLUMAN –Scaling (virtualization) –Service monitoring –Remote data transfer –RFE Integration with other monitoring (Windows, Nagios) Data aggregation –CPU load over all cluster –Service state over all nodes and applications Remote test/probing
CERN IT Department CH-1211 Geneva 23 Switzerland t CF LAS Purpose: –Alarming based on Lemon information Customers: –CF-SAO (sys-admin, operators), IT&PH (SM and VOC) What is alarmed: –Exceptions coming from Lemon agents Main characteristics: –Based on Oracle backend (logic implemented in pl/sql) –Web interface for operator –Interacting with ITCM (Remedy) Needs to be addressed: –Integration of information from other monitoring system (e.g. Windows) –Alarms from distributed Lemon instances –Migration of ITCM to Service-Now Overlap: –Can we have basic alarming system infrastructure shared with other monitoring systems?
CERN IT Department CH-1211 Geneva 23 Switzerland t CF SLS Purpose: –Display service status information Customers: –IT&PH (SM and VOC), management What is displayed: –Service information (availability, KPI, numerical values) provided by user Main characteristics: –Service information is calculated/estimated and provided by the user –Service definition is in SDB Needs to be addressed: –Visualization of dependency between services and hosts –Software consolidation: What can we share with Lemon? –Migration of (only?) SDB to Service-Now? –New service monitoring integrated with Lemon? Overlap: –Can we share dash-board/status-board?
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Recent challenges Scaling RFE beyond current design Service monitoring –with recovery action, alarming, etc. Remote monitoring and management Software consolidation Integration with others Any common strategy?
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Manpower Lemon & LAS – 50% of staff FTE –1 fellow FTE till the end of June 2011 SLS –<1 fellow FTE till the end of June 2011
CERN IT Department CH-1211 Geneva 23 Switzerland t CF How to address In-house development –expensive Replacement –No silver bullet (CHEP, EE experiment experience) –Tool review needed –Nagios, Ganglia, Zennos … Combination –Replace monitoring component to minimize development –Data collection and visualization infrastructure independent from tool(s) Any solution to build together?
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Backup From now on backup
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Visualization in Lemon 3 subclusters of cluster lxbatch node
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Services in Lemon Number of registered metrics during last year Host 1 Application A HW scan Host 2 Application B HW scan CPU load partitions occupancy is app running log parsing X log parsing Y SMART IPMI CPU load partitions occupancy is app running log parsing SMART IPMI Except. 1 Except. 2 Except. 3 Except. 4 Except. 5 Current LAS view Service alarms Sys-admin alarms
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon SQL TCP/UDP HTTP Sensor Monitoring Agent Local Cache Oracle Database Repository Backend Application Server Lemon CLI Lemon-host-check Web Browser RRD tool / Python Apache/ PHP (command line tool to access data) (command line tool node exceptions) Measurement Repository User InterfacesNode Monitoring
CERN IT Department CH-1211 Geneva 23 Switzerland t CF LAS Exception Metrics ITCM Lemon-webLAS GUI Lemon Oracle DB LAS Business Logic PL/SQL Operator Administrator High level objects CPU load over all cluster LAS include windows monitoring ITCM will be migrated to ITCM
CERN IT Department CH-1211 Geneva 23 Switzerland t CF SLS SLS-web USER Scripts XMLSDBRRD test/probes LemonDB SLS XML SDB Service Catalog? Infrastructure consolidation with Lemon with something new? Does cover SDB-SLS monitoring of all services?
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Discussed Monitoring Task I. Current activity Lemon-web 2.0 development ongoing Lemon enhancements under consideration New Lemon DB schema –Increase of monitoring data impacts the size and performance of DB repository –Impact on many Lemon components Lemon repository data export –Reduce amount of historical data stored in DB export to data files
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Discussed Monitoring Task II. Lemon-sensors review/development –Pending enhancement practically on all core sensors –New sensors (e.g. for SafeHost) –Python API High level objects –Trigger alarm if > 40% of cluster nodes is on high load –Data aggregation on data collection Integration with Windows monitoring (one LAS) Support for virtualization (new instances +federated web)
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Discussed Monitoring Task III. SLS –No pending RFE –Propagate deletion of service from SDB to SLS –Graphical representation of dependency between services and hosts (and alarms) –Ongoing DB backend consolidation –Possibly migration of SDB to Service Now
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon Monitoring ~11k monitored entities (~8k nodes) –performance monitoring CPU Load, partitions information –application monitoring File, log parsing –power, temperature –remote (ping, http, snmp) 5 core sensors covering ~60% of performance and application monitoring ~30 misc sensor –hw_scan, snmp, castor >5000 nodes with metrics
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon instances ~1.1k metrics, 473 exceptions, 254 classes ~1.7M monitored metrics across Lemon ~300GB of data / month produced Covered by 2 servers running in parallel –recent data for LAS performance –historical data no problem with powerful data
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon instances
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Monitoring DB Lemonops (latest only data) Size Used Avail Use% Mounted on 32G 29G 3.8G 89% /ORA/dbs03/LEMONOP Lemonrac (historical data) Size Used Avail Use% Mounted on 1.6T 1.5T 76G 96% /ORA/dbs03/LEMONRAC Data income: ~300 GB/month