Presentation is loading. Please wait.

Presentation is loading. Please wait.

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)1 Overview of monitoring tools for Grid Systems Varenna, 12 May.

Similar presentations


Presentation on theme: "May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)1 Overview of monitoring tools for Grid Systems Varenna, 12 May."— Presentation transcript:

1 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)1 Overview of monitoring tools for Grid Systems Varenna, 12 May 2008 Antonio Pierro INFN-BARI (Italy) Antonio.pierro ba.infn.it

2 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)2 Outlines  Overview of EGEE monitoring tools:  SAM (Service Availability Monitoring)  GridMap  GStat (Global Grid Information Monitoring System)  GridView  GridICE (infrastructure and application monitoring)

3 May 12, 2008 3/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)  Resource Utilization and Performance Evaluation  Resources observability is needed for an optimized Grid utilization  Management Decisions  To reduce time spent waiting for Resource Availability  Be always aware of what is happening  Debugging purposes  to help the operations team locate and troubleshoot the problems  Grid resources and services are subject to failures Why do we need monitoring?

4 May 12, 2008 4/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) Requirements for a Grid Monitoring tool  Scalable  Dynamic  Robust  Should be integrated with other Grid Technologies and middleware (security infrastructure, resource brokers, schedulers,...)

5 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)5 SAM (introduction)  Service Availability Monitoring framework (SAM) :  Monitoring all grid services and nodes not only CE  It is used in the validation process of sites and services  SAM wiki : http://goc.grid.sinica.edu.tw/gocwiki/SAMhttp://goc.grid.sinica.edu.tw/gocwiki/SAM  SAM portal : https://lcg-sam.cern.ch:8443/sam/sam.pyhttps://lcg-sam.cern.ch:8443/sam/sam.py  Service and Site status are recorded (several snapshots per day)  Daily, weekly, monthly availability is calculated using integration (averaging) over the given period  Official evaluation of T0,T1 and T2 sites.

6 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)6 SAM(performed tests) 1/2  CE  job submission - UI->RB->CE->WN chain  version of CA certificates installed (on WN!) and software middleware (on WN!)  replica management tests-using lcg-utils,default SE defined on WN and a selected “central” SE  accessibility of experiments software directory - environment variable, directory existence  accessibility of VO tag management tools  other tests: R-GMA client check, Apel accounting records  SE, SRM  storing file from the UI - using lcg-cr command with LFC registration  getting file back to the UI - using lcg-cp command  removing file - using lcg-del command with LFC de-registration

7 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)7 SAM(performed tests) 2/2  LFC  directory listing - using lfc-ls command on /grid  creating file entry in /grid/ area  FTS  checking if FTS is published correctly in the BDII  channel listing - using glite-transfer-channel-list command with ChannelManagement service  transfer test (in development):  Standalone tests  GSTAT, RB  VO specific tests as well

8 SAM - CE sensor Tests France Region, VO OPS

9 OK: normal status Errror: subject has failed and problem is localized *** Running R-GMA client test on alifarm57.ct.infn.it *** Inserting tuple: ERROR: Could not contact R-GMA server at grid005.ct.infn.it:8443 – (104, 'Connection reset by peer') ERROR: Could not contact R-GMA server at grid005.ct.infn.it:8443 – (104, 'Connection reset by peer') Failed Timeout when executing test CE-sft-rgma after 600 seconds! subject may fail soon

10 May 12, 2008 10/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)  It publishes the same data of SAM in a different way  Is a simple interactive and user-friendly interface to see the state of Grid  Sites or services of the Grid are represented by rectangles of different size and colour allowing two dimensions of data to be visualized simultaneously.  This representation of monitoring data requires much less space than conventional sorted tables or bar charts. GridMAP

11 May 12, 2008 11/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) GridMAP GridMap Prototype – visualizing the state of the grid the state of the grid – SAM test Daily availability

12 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)12 GridView 1/2  It is a visualization system for viewing monitoring information  Approach:  Collections monitoring information from different sources, e.g.:  SAM, GridFTP monitor, RB Logs  The records of monitoring information are in a central Oracle database at CERN  Visualizations of summary data through Web interface  Target: Grid operators, Site administrators, VO managers

13 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)13 GridView (web page) 2/2 Statistic of data transfert jobs running service availability

14 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)14 GStat 1/2  GStat is built using Python scripts that generate web based reports used by Grid site administrators to troubleshoot Information System issues or access usage information.  GStat scripts are executed periodically to query and collect the information published by each site in the Grid Infrastructure.  The information published is then processed by extensible analysis framework that checks for IS failures and errors.  Target:  Grid operators  Site administrators

15 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)15 The main page of GStat shows the overall status and usage statistic for each site. GStat site detailed report GStat site resource status GStat 2/2

16 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)16 EGEE EGEE-SWE RDIG EGEE-SEE Grid.it GILDA CMS ATLAS EUMedGrid EUChinaGrid EUIndiaGrid BalticGrid LIBI BioinfoGRID EELA OMIIBeGrid  It is a distributed monitoring tool for Grid systems  is evolving in the context of EU-EGEE and many other EU Grid projects  fully integrated with the gLite-3.x Middleware  Self-configurable collection and presentation  just give the URL of the root Grid Information Service (GIS)  Installed servers are monitoring Grid resources in the scope of: GridICE: Overview

17 May 12, 2008 17/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) Recent evolution of GridICE lightweight sensor + VOMS information  Attributes measured by the Job Monitoring sensor To reduce its intrusiveness in terms of resources consumption:  Two daemons running and a probe executed periodically  They listen to a set of log files and collect the relevant information  Few LRMS commands to retrieve jobs status  The status of all jobs is stored in a cache (stateful behaviour)

18 May 12, 2008 18/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) Integration with local monitoring systems (LEMON)  Grid monitoring integrated with local monitoring  The last server version is very simple to install  The client installation may be turned on in the standard middleware LCG installation (no additional operation are needed)  The LEMON monitoring system and alarm management are integrated in the new version of the GridICE server  The local sensor currently used for farm monitoring can be interfaced with GridICE to collect all the available data  The back-end is realized with LEMON  Local farm monitoring that are using LEMON can be integrated with GridICE

19 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)19 LRMSinfo The LRMS Info sensor provides aggregated information of the Local Resource Manager System

20 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)20 We focus on the following categories of users:  VO manager  actual set of resources accessible to VO members: “How many jobs submitted by my users are running or queued?” (with details of the VOMS groups and/or single user)  Grid operator  all resources under responsibility of a Grid Operator Center (“How many resources are available?”)  Site administrator  site resources offered to a Grid (“Is there any service down?”)  Grid users  The status of their jobs on a grid.

21 How do we identify the user/role? May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)21  The users are identified with the digital certificate installed in its browser  a valid CA certificate  server based on https protocol  The new sensor are able to retrieve the VOMS information  VOMS information: groups and roles of users submitting the jobs  The related role (e.g., site manager, VO manager) can be retrieved by GridICE database.

22 May 12, 2008 22/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) “Standard user ” monitoring (1) User that has no jobs submitted and no role registered

23 May 12, 2008 23/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) “Standard user ” monitoring (2) An authenticated user sees only his/her own jobs

24 May 12, 2008 24/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) “Standard user ” monitoring (3) An authenticated user sees only his/her own jobs exit status = 0 => successfully jobs exit status <> 0 =>failure jobs

25 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)25 Grid monitoring from the VO Manager perspectives

26 May 12, 2008 26Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) Grid monitoring from the Site Manager perspectives

27 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)27 Acronyms and Abbreviations (1): ACL - Access Control List APEL - Accounting Processor for Event Logs API - Application Programming Interface BDII - Berkeley Database Information Index CA - Certificate Authority CE Computing Element: a Grid-enabled computing resource CERN - European Organisation for Nuclear Research GIIS - Grid Index Information Service. MDS index node. Aggragates information dCache - (disk pool management system) DN - Distinguished Name (X.500, LDAP) EGEE - Enabling Grids for E-sciencE FTS - File Transfer Service (EGEE) GARR - Gruppo per l'Armonizzazione delle Reti della Ricerca GGUS - Global Grid User Support GIIS - Grid Information Index Server GILDA - Grid Infn Laboratory for Dissemination Activities GRIS - Grid Resource Information Service. Collects information for MDS. IN2P3 - Institut National de Physique Nucléaire et de Physique des Particules INFN - Istituto Nazionale di Fisica Nucleare (in Italy) ISO - International Standardization Organization JDL - Job Description Language LB - Logging and Bookeeping service LEMON - LHC Era Monitoring

28 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)28 Acronyms and Abbreviations (2): LCG - LHC Computing Grid LDAP - Lightweight Directory Access Protocol LDIF - LDAP Data Interchange Format LDN - Logical Dataset Name LFC - LCG File Catalog LFN - Logical File Name LHC - Large Hadron Collider. Under construction. Hosts CMS, ATLAS, and other experiments. LRMS - Local Resource Management System MDS - Meta Directory Service, or Monitoring and Discovery Service (Globus) MPI - Message Passing Interface (Globus) PhEDEx - Physics Experiment Data Export (CMS) RFIO - Remote File I/O R-GMA - Relational Grid Monitoring Architecture (EGEE). A monitoring system similar to MDS ROC - Regional Operations Centre RLS - Replica Locator Service SE - Storage Element SOAP - Simple Object Access Protocol SRM - Storage Resource Management VO - Virtual Organization, e.g., an experiment VOBOX - VO box VOMRS - Virtual Organization Management Registration Service VOMS - VO Management Service X.509 - (ITU-T standard for Public-key and attribute certificate frameworks)

29 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)29 References  SAM  http://goc.grid.sinica.edu.tw/gocwiki/SAME_Planning  https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=CE&regions=  GRIDMAP  http://gridmap.cern.ch/gm/  http://cerncourier.com/cws/article/cnl/31986  Gstat  http://goc.grid.sinica.edu.tw/gstat/  GridView:  Portal: http://gridview.cern.ch/ Portal: http://gridview.cern.ch/  TWiki: https://twiki.cern.ch/twiki/bin/view/LCG/GridView  GridICE:  http://gridice.forge.cnaf.infn.it/

30 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)30 Conclusions  There are several monitoring tools available for the Grid system  Which tool do you use?  It depends by your role in grid  Sometimes you could use more tools at the same time to satisfy your needs

31 May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)31 Thank You


Download ppt "May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)1 Overview of monitoring tools for Grid Systems Varenna, 12 May."

Similar presentations


Ads by Google