Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it.

Slides:



Advertisements
Similar presentations
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Advertisements

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
08/11/908 WP2 e-NMR Grid deployment and operations Technical Review in Brussels, 8 th of December 2008 Marco Verlato.
Enabling Grids for E-sciencE Grid Monitoring Workshop Monterey Bay, California, 25 June 2007 Antonio Pierro INFN-BARI (Italy) Antonio.pierro.
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)1 Overview of monitoring tools for Grid Systems Varenna, 12 May.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Information System on gLite middleware Vincent.
A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone.
The EDGeS project receives Community research funding 1 SG-DG Bridges Zoltán Farkas, MTA SZTAKI.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) The Egyptian Grid Infrastructure Maha Metawei
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.
VOX Project Tanya Levshina. 05/17/2004 VOX Project2 Presentation overview Introduction VOX Project VOMRS Concepts Roles Registration flow EDG VOMS Open.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi
II EGEE conference Den Haag November, ROC-CIC status in Italy
– n° 1 Grid di produzione INFN – GRID Cristina Vistoli INFN-CNAF Bologna Workshop di INFN-Grid ottobre 2004 Bari.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Enabling Grids for E-sciencE GridICE: overview and current status Guido Cuscela INFN – Bari Service Challenge Technical Meeting September.
Service Availability Monitoring
Job monitoring and accounting data visualization
NGI and Site Nagios Monitoring
Use of Nagios in Central European ROC
INFNGRID Monitoring Group report
Brief overview on GridICE and Ticketing System
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Sergio Fantinel, INFN LNL/PD
GridICE monitoring for the EGEE infrastructure
a VO-oriented perspective
EGEE Middleware: gLite Information Systems (IS)
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Information Services Claudio Cherubino INFN Catania Bologna
Presentation transcript:

Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it I Corso di formazione INFN per amministratori di siti Grid Martina Franca, 5-9 Novembre 2007

Disclaimer This presentation is based on materials provided and authorized by the EGEE project and is freely available to download and use according to the terms of the following license:

Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

Grid Monitoring Grid monitoring has to provide The knowledge of the type, state and features of the resources constituting the Grid by means of: –Grid Resources Inventory –Grid Resources Behavior –Grid Resources Availability

Grid Resources Inventory Instantaneous picture of the resources constituting the Grid to have an idea on how Grid resources are shared among sites: –Number of Computing Element (CE), Worker Node (WN) and Storage Element (SE) – Number of Jobs running and waiting in all the Grid, for VOs

Grid Resources Behavior Measuring a set of evolving data to investigate historical/statistical aspects of a Grid: –Percentage of jobs aborted in a site for a particular Virtual Organization (VO) in a certain period of time –Time duration of a fault situation for a particular service or Grid process –Percentage of CPU/RAM usage during the Grid activity

Grid Resources Availability Evaluating the accessibility of the Grid main services at Regional, Site and VO level for a grid usage improvement –Actual Grid services down (e.g. CE, WN, SE) –Actual Grid site components not working properly (es. authentication and authorization, job submission, data management) –Actual Jobs load in a certain Site –Actual Min/Max Sloat Free where you can submitt jobs

Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

Day by Day Operations /1 INFNGrid must be daily monitored both by the ROC team and Site Managers to test its functionalities –Service Level Agreement according to the Memorandum of Understanding  Site must provide a Grid production level

Day by Day Operation /2 Monitoring procedure is based on: –Problem Detection and Diagnosis  use of monitoring tools  low level check on site –Problem Tracking (see next talk on Support Systems)  Use of helpdesk ticketing system

Grid Site Monitoring: General Requirements Efficently scale increasing the number of nodes monitored Use lightweight sensors –Avoid computers overload Publish reliable data –Hard task in Grid environment Send notification on daemons/machines problems Take action in case of problems on services Allow metrics addition easily –New interesting parameters must be added without to much work Be “Grid Aware”

Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

Monitoring Cluster Systems Use of systems to spot and notify sys administrators in case of outages via , pager or other alarms Top systems used in Grid Sites –Ganglia  –Lemon  shtml –Monit 

Ganglia PRO: Open source project developed by Berkley University Adopted by many sites Easy to install and manage Useful charts –Can easily detect spikes, thanks to the possibility to define the update time Easy to add new metrics CONS: Alarms and reactions on failures not available Problems in scaling to hundreds or thousands node with an high frequency sampling It is not aware of gLite grid-services Data can be stored only in RRD “DB” –No detailed historical data are available

Lemon /1 PRO: Open source project developt by CERN Its goal is to provide a monitoring system that can scale at thousand node without problems It is possible to have the detailed history using an Oracle DB as RDBMS Many advanced parameters can be monitored using standard sensor Less PRO: It is also possible to install LEMON without DB back-end –With less functionality It has alarms and reaction on failure –The complete set of function is available only with a DB back- end installation Configuration yet available for some grid-services –must be customized according to the site

Lemon /2 CONS: It is not so easy to install and manage It is not so simple to add metrics or checks A more “friendly” DB back-end is not available yet It does not have the hourly graph: can be a problem in order to detect spikes

Monit PRO: Public Open Source project It has a good base of standard checks for well known services Lightweight, easy to install, configure and manage A simple http server built-in to check the status of each machine CONS: It is not really a “monitoring system” but an “alert system” A single web page with the status of all monitored machine is not available yet No charts available yet

Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

Monitoring Grid Systems The INFNGrid project adopts three main Grid monitoring tools to check if its Grid resources and services work as expected –GridICE  –GSTAT  –SAM 

GridICE: Overview Based on the gLite Information System –Daily discovery of new GRISEs –Periodic queries to the discovered GRISes (every min)  CE, SE, Site BDII Standard Glue info published  Extended GRIS (EX GRIS) Hosts info (es daemons monitoring) Job monitoring Computing info gathered from Site Local Resource Management System –Information collected in a central RDMS and published in the Web context

GridICE: Geo View

GridICE Site View Standard Parameters /1 Downtime status (from GOC DB) Country information (from GridICE detection mechanism) Administrative information (from GOC DB)

GridICE: Site View Extended Parameters Site job load as measure of how busy is the site ((CPU#- CPUFree)/CPU#)*100 Power estimation calculated by adding the power value (SpecInt) of each CPU of the site WN and CPU number CPULoad is computed by considering the load1min as reported by the LRMS for all the WNs

GridICE: Site View Standard Parameters /2 Number of available gatekeepers (CE) Number of configured queues on CE Running and waiting jobs

GridICE: Site View Standard Storage Parameters Available, total and percentage used on the storage element of the site

GridICE: Site View Monitored Hosts Number of monitored hosts per site

GridICE: Host View General Use Case 2 Grid operator – Site administrator Detecting Resource Brokers with problems

GridICE: Host View Details

GridICE: GRIS View General Use Case 3 Grid operator – Site administrator Detecting GRIS’s status

GridICE: GRIS View Detail

Job View Job section to track VO users activity in order to: –Search among a huge number of jobs –Inspect jobs resource consuption –Aggregate jobs info based on VOMS attributes (next release)  Info selected according with the consumer ID (group/role)

Chart View: Site manager viepoint

SAM: CE functionality tests You can customize your personal SAM interface with desired tests chosen from a list of possibility –Job submission –CA certificate version installed on WN –Middleware version installed on WN –Host certificate validity –Replica management tests using lcg-utils –Accessibility of experiments software directory –Accessibility of VO management tools

SAM: SE and LFC Functionality Tests SE functionality tests –File copy & register from UI using lcg-cr –File retrieval to the UI using lcg-cp –File delete using lcg-del LFC functionality tests –Directory listing using lfc-ls –File entry creation

SAM: Error Investigation

GSTAT: Overview Based on gLite information System Uses scripts to generate web-accessible reports Scripts are executed periodically (every 15 mins) to query and collect information published by each site The retrieved information is processed by an analysis framework that checks for failures and errors

GStat: General View

GSTAT: Site Details

References GridICE - Web site – GSTAT - Web doc – SAM - Article –Global Grid Monitoring: the EGEE/WLCG case  High Performance Distributed Computing. Proceedings of the 2007 workshop on Grid monitoring Overview of Grid Monitoring Tools – Article –A taxonomy of grid monitoring systems  Future Generation Computer Systems Volume 21, Issue 1, 1 January 2005, Pages