Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it I Corso di formazione INFN per amministratori di siti Grid Martina Franca, 5-9 Novembre 2007
Disclaimer This presentation is based on materials provided and authorized by the EGEE project and is freely available to download and use according to the terms of the following license:
Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools
Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools
Grid Monitoring Grid monitoring has to provide The knowledge of the type, state and features of the resources constituting the Grid by means of: –Grid Resources Inventory –Grid Resources Behavior –Grid Resources Availability
Grid Resources Inventory Instantaneous picture of the resources constituting the Grid to have an idea on how Grid resources are shared among sites: –Number of Computing Element (CE), Worker Node (WN) and Storage Element (SE) – Number of Jobs running and waiting in all the Grid, for VOs
Grid Resources Behavior Measuring a set of evolving data to investigate historical/statistical aspects of a Grid: –Percentage of jobs aborted in a site for a particular Virtual Organization (VO) in a certain period of time –Time duration of a fault situation for a particular service or Grid process –Percentage of CPU/RAM usage during the Grid activity
Grid Resources Availability Evaluating the accessibility of the Grid main services at Regional, Site and VO level for a grid usage improvement –Actual Grid services down (e.g. CE, WN, SE) –Actual Grid site components not working properly (es. authentication and authorization, job submission, data management) –Actual Jobs load in a certain Site –Actual Min/Max Sloat Free where you can submitt jobs
Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools
Day by Day Operations /1 INFNGrid must be daily monitored both by the ROC team and Site Managers to test its functionalities –Service Level Agreement according to the Memorandum of Understanding Site must provide a Grid production level
Day by Day Operation /2 Monitoring procedure is based on: –Problem Detection and Diagnosis use of monitoring tools low level check on site –Problem Tracking (see next talk on Support Systems) Use of helpdesk ticketing system
Grid Site Monitoring: General Requirements Efficently scale increasing the number of nodes monitored Use lightweight sensors –Avoid computers overload Publish reliable data –Hard task in Grid environment Send notification on daemons/machines problems Take action in case of problems on services Allow metrics addition easily –New interesting parameters must be added without to much work Be “Grid Aware”
Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools
Monitoring Cluster Systems Use of systems to spot and notify sys administrators in case of outages via , pager or other alarms Top systems used in Grid Sites –Ganglia –Lemon shtml –Monit
Ganglia PRO: Open source project developed by Berkley University Adopted by many sites Easy to install and manage Useful charts –Can easily detect spikes, thanks to the possibility to define the update time Easy to add new metrics CONS: Alarms and reactions on failures not available Problems in scaling to hundreds or thousands node with an high frequency sampling It is not aware of gLite grid-services Data can be stored only in RRD “DB” –No detailed historical data are available
Lemon /1 PRO: Open source project developt by CERN Its goal is to provide a monitoring system that can scale at thousand node without problems It is possible to have the detailed history using an Oracle DB as RDBMS Many advanced parameters can be monitored using standard sensor Less PRO: It is also possible to install LEMON without DB back-end –With less functionality It has alarms and reaction on failure –The complete set of function is available only with a DB back- end installation Configuration yet available for some grid-services –must be customized according to the site
Lemon /2 CONS: It is not so easy to install and manage It is not so simple to add metrics or checks A more “friendly” DB back-end is not available yet It does not have the hourly graph: can be a problem in order to detect spikes
Monit PRO: Public Open Source project It has a good base of standard checks for well known services Lightweight, easy to install, configure and manage A simple http server built-in to check the status of each machine CONS: It is not really a “monitoring system” but an “alert system” A single web page with the status of all monitored machine is not available yet No charts available yet
Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools
Monitoring Grid Systems The INFNGrid project adopts three main Grid monitoring tools to check if its Grid resources and services work as expected –GridICE –GSTAT –SAM
GridICE: Overview Based on the gLite Information System –Daily discovery of new GRISEs –Periodic queries to the discovered GRISes (every min) CE, SE, Site BDII Standard Glue info published Extended GRIS (EX GRIS) Hosts info (es daemons monitoring) Job monitoring Computing info gathered from Site Local Resource Management System –Information collected in a central RDMS and published in the Web context
GridICE: Geo View
GridICE Site View Standard Parameters /1 Downtime status (from GOC DB) Country information (from GridICE detection mechanism) Administrative information (from GOC DB)
GridICE: Site View Extended Parameters Site job load as measure of how busy is the site ((CPU#- CPUFree)/CPU#)*100 Power estimation calculated by adding the power value (SpecInt) of each CPU of the site WN and CPU number CPULoad is computed by considering the load1min as reported by the LRMS for all the WNs
GridICE: Site View Standard Parameters /2 Number of available gatekeepers (CE) Number of configured queues on CE Running and waiting jobs
GridICE: Site View Standard Storage Parameters Available, total and percentage used on the storage element of the site
GridICE: Site View Monitored Hosts Number of monitored hosts per site
GridICE: Host View General Use Case 2 Grid operator – Site administrator Detecting Resource Brokers with problems
GridICE: Host View Details
GridICE: GRIS View General Use Case 3 Grid operator – Site administrator Detecting GRIS’s status
GridICE: GRIS View Detail
Job View Job section to track VO users activity in order to: –Search among a huge number of jobs –Inspect jobs resource consuption –Aggregate jobs info based on VOMS attributes (next release) Info selected according with the consumer ID (group/role)
Chart View: Site manager viepoint
SAM: CE functionality tests You can customize your personal SAM interface with desired tests chosen from a list of possibility –Job submission –CA certificate version installed on WN –Middleware version installed on WN –Host certificate validity –Replica management tests using lcg-utils –Accessibility of experiments software directory –Accessibility of VO management tools
SAM: SE and LFC Functionality Tests SE functionality tests –File copy & register from UI using lcg-cr –File retrieval to the UI using lcg-cp –File delete using lcg-del LFC functionality tests –Directory listing using lfc-ls –File entry creation
SAM: Error Investigation
GSTAT: Overview Based on gLite information System Uses scripts to generate web-accessible reports Scripts are executed periodically (every 15 mins) to query and collect information published by each site The retrieved information is processed by an analysis framework that checks for failures and errors
GStat: General View
GSTAT: Site Details
References GridICE - Web site – GSTAT - Web doc – SAM - Article –Global Grid Monitoring: the EGEE/WLCG case High Performance Distributed Computing. Proceedings of the 2007 workshop on Grid monitoring Overview of Grid Monitoring Tools – Article –A taxonomy of grid monitoring systems Future Generation Computer Systems Volume 21, Issue 1, 1 January 2005, Pages