Download presentation
Presentation is loading. Please wait.
Published byElaine Sparks Modified over 8 years ago
1
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview Lemon LAS SLS Recent and coming challenges
2
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF CC Monitoring We have monitoring tools Lemon LAS SLS new requirements are coming progressively environment is changing
3
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon Purpose: –CC monitoring (Linux boxes) Customers: –CF-SAO (sys-admin, operators), IT&PH (SM and VOC) What is monitored: –Performance monitoring (e.g. CPU load) –Application (e.g. log parsing) –Facilities monitoring (e.g. power, temperature) - recently added Main characteristics: –Agent based system –Corrective action performed on node Overlap: –Data collection
4
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon Needs to be addressed –Visualization currently under development (see Dec 2010 post C5) Designed for simple cluster-host hierarchy limit for facilities monitoring Scale limits Advance visualization out of design scope addressed by CLUMAN –Scaling (virtualization) –Service monitoring –Remote data transfer –RFE Integration with other monitoring (Windows, Nagios) Data aggregation –CPU load over all cluster –Service state over all nodes and applications Remote test/probing
5
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF LAS Purpose: –Alarming based on Lemon information Customers: –CF-SAO (sys-admin, operators), IT&PH (SM and VOC) What is alarmed: –Exceptions coming from Lemon agents Main characteristics: –Based on Oracle backend (logic implemented in pl/sql) –Web interface for operator –Interacting with ITCM (Remedy) Needs to be addressed: –Integration of information from other monitoring system (e.g. Windows) –Alarms from distributed Lemon instances –Migration of ITCM to Service-Now Overlap: –Can we have basic alarming system infrastructure shared with other monitoring systems?
6
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF SLS Purpose: –Display service status information Customers: –IT&PH (SM and VOC), management What is displayed: –Service information (availability, KPI, numerical values) provided by user Main characteristics: –Service information is calculated/estimated and provided by the user –Service definition is in SDB Needs to be addressed: –Visualization of dependency between services and hosts –Software consolidation: What can we share with Lemon? –Migration of (only?) SDB to Service-Now? –New service monitoring integrated with Lemon? Overlap: –Can we share dash-board/status-board?
7
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Recent challenges Scaling RFE beyond current design Service monitoring –with recovery action, alarming, etc. Remote monitoring and management Software consolidation Integration with others Any common strategy?
8
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Manpower Lemon & LAS – 50% of staff FTE –1 fellow FTE till the end of June 2011 SLS –<1 fellow FTE till the end of June 2011
9
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF How to address In-house development –expensive Replacement –No silver bullet (CHEP, EE experiment experience) –Tool review needed –Nagios, Ganglia, Zennos … Combination –Replace monitoring component to minimize development –Data collection and visualization infrastructure independent from tool(s) Any solution to build together?
10
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Backup From now on backup
11
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Visualization in Lemon 3 subclusters of cluster lxbatch node
12
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Services in Lemon Number of registered metrics during last year Host 1 Application A HW scan Host 2 Application B HW scan CPU load partitions occupancy is app running log parsing X log parsing Y SMART IPMI CPU load partitions occupancy is app running log parsing SMART IPMI Except. 1 Except. 2 Except. 3 Except. 4 Except. 5 Current LAS view Service alarms Sys-admin alarms
13
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon SQL TCP/UDP HTTP Sensor Monitoring Agent Local Cache Oracle Database Repository Backend Application Server Lemon CLI Lemon-host-check Web Browser RRD tool / Python Apache/ PHP (command line tool to access data) (command line tool node exceptions) Measurement Repository User InterfacesNode Monitoring
14
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF LAS Exception Metrics ITCM Lemon-webLAS GUI Lemon Oracle DB LAS Business Logic PL/SQL Operator Administrator High level objects CPU load over all cluster LAS include windows monitoring ITCM will be migrated to ITCM
15
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF SLS SLS-web USER Scripts XMLSDBRRD test/probes LemonDB SLS XML SDB Service Catalog? Infrastructure consolidation with Lemon with something new? Does cover SDB-SLS monitoring of all services?
16
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Discussed Monitoring Task I. Current activity Lemon-web 2.0 development ongoing Lemon enhancements under consideration New Lemon DB schema –Increase of monitoring data impacts the size and performance of DB repository –Impact on many Lemon components Lemon repository data export –Reduce amount of historical data stored in DB export to data files
17
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Discussed Monitoring Task II. Lemon-sensors review/development –Pending enhancement practically on all core sensors –New sensors (e.g. for SafeHost) –Python API High level objects –Trigger alarm if > 40% of cluster nodes is on high load –Data aggregation on data collection Integration with Windows monitoring (one LAS) Support for virtualization (new instances +federated web)
18
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Discussed Monitoring Task III. SLS –No pending RFE –Propagate deletion of service from SDB to SLS –Graphical representation of dependency between services and hosts (and alarms) –Ongoing DB backend consolidation –Possibly migration of SDB to Service Now
19
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon Monitoring ~11k monitored entities (~8k nodes) –performance monitoring CPU Load, partitions information –application monitoring File, log parsing –power, temperature –remote (ping, http, snmp) 5 core sensors covering ~60% of performance and application monitoring ~30 misc sensor –hw_scan, snmp, castor >5000 nodes with 150-200 metrics
20
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon instances ~1.1k metrics, 473 exceptions, 254 classes ~1.7M monitored metrics across Lemon ~300GB of data / month produced Covered by 2 servers running in parallel –recent data for LAS performance –historical data no problem with powerful data
21
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon instances
22
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Monitoring DB Lemonops (latest only data) Size Used Avail Use% Mounted on 32G 29G 3.8G 89% /ORA/dbs03/LEMONOP Lemonrac (historical data) Size Used Avail Use% Mounted on 1.6T 1.5T 76G 96% /ORA/dbs03/LEMONRAC Data income: ~300 GB/month
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.