Service Availability Monitoring

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Experience with Site Functional Tests Piotr Nyczyk CERN IT/GD WLCG Service Workshop Mumbai, February 2006.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Feedback from joining and first COD shift M.Radecki on behalf of CE ROC COD-7, Lyon, France.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it.
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.
Daniele Bonacorsi Andrea Sciabà
Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006
Job monitoring and accounting data visualization
Regional Operations Centres Core infrastructure Centres
Operations Status Report
EGEE is a project funded by the European Union
Use of Nagios in Central European ROC
Grid Deployment Overview
Andreas Unterkircher CERN Grid Deployment
Ian Bird GDB Meeting CERN 9 September 2003
POW MND section.
GDB 8th March 2006 Flavia Donno IT/GD, CERN
Brief overview on GridICE and Ticketing System
Grid Operations Procedures
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Accounting at the T1/T2 Sites of the Italian Grid
Sergio Fantinel, INFN LNL/PD
Lavoisier : a way to integrate heteregeneous monitoring systems.
1 VO User Team Alarm Total ALICE ATLAS CMS
Short update on the latest gLite status
A Messaging Infrastructure for WLCG
Technical workshop: Grid and ROC operations; Planning
GGUS Partnership between FZK and ASCC
SAM Alarm Triggering and Masking
EGEE SA1 – Operations Status Overview
EGEE: Grid Operations & Management
Author: Laurence Field (CERN)
Pierre Girard ATLAS Visit
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Service Level metrics- Monitoring and Reporting
Information Services Claudio Cherubino INFN Catania Bologna
The LHCb Computing Data Challenge DC06
Presentation transcript:

Service Availability Monitoring Piotr Nyczyk, CERN/GD GDB Meeting CERN, 9th November 2005

Outline Aim of service availability monitoring Existing monitoring solution: Site Functional Tests GStat Integration using R-GMA Present use in grid operation infrastructure CIC Dashboard CIC-on-duty operations Future plans and responsibilities

Purpose Service availability monitoring: Two main goals: Monitor grid services (without going into the site fabric level) Verify SLAs and LCG MoUs It must be site independent and monitor all sites in a uniform way Monitor both quantity and quality To be used by grid operation teams: CIC on duty team Framework plus some metrics already in place: SFT and GStat

Existing monitoring solution

Site Functional Tests (SFT) Submits a short-living job to all sites (Computing Elements) to test various aspects of functionality (using WMS – RB): Job submission - CE availability: Can I submit a job and retrieve the output? Basic environment tests: software version, BrokerInfo, CSH scripts Security tests: CA certificates, CRLs, ... (work in progress) Data management tests: basic operations using lcg-utils on the default Storage Element (SE) and chosen “central” SE (usually at CERN - 3rd party replication tests) VO environment tests: tag management, software installation directory + VO specific job submission and tests (maintained by VOs, example: LHCb and dirac- test) What is not covered? Only CEs and batch farms are really tested - other services are tested indirectly (usually by causing failures in all sites at once): SEs, RB, top-level BDII, RLS/LFC, R- GMA registry/schema Maintained by Operations Team at CERN (+several external developers) and run as a cron job on LXPLUS (using AFS based LCG UI) Jobs submitted at least every 3 hours (+ on demand resubmissions)

SFT - report Shows results matrix with all sites Selection of “critical” tests for each VO to define which sites are good/bad Detailed test log available for troubleshooting and debugging Deployed on two machines at CERN (load distribution, fault tolerance)

GIIS Monitor (GStat) Monitoring tool for Information System: Periodically queries all Site BDIIs (but doesn’t monitor Top-level BDIIs) Checks if Site BDIIs are available Checks integrity of published information Checks for missing entities, attributes Detects and reports information about some of the Services: RB, MyProxy, LFC but doesn’t monitor them Detects duplicated services in some cases (eg. 2 global LFC servers a single VO) Cron job maintained and run by Min-Hong Tsai in Taipei - updates every 5 minutes

Integration using R-GMA R-GMA is now mature enough to be used as the “universal BUS” for monitoring information SFT and GStat are now publishing results to R-GMA GOC DB is used to get the list of sites and nodes to monitor, which are published to R-GMA together with scheduled downtime information We managed to solve most obvious scalability problems: ~170 sites About 3.5M tuples for 1 month history with full detail After one month only summary information

Prototype sites availability metric Using our current data schema and R-GMA we managed to integrate monitoring information from SFT and GStat Summary generator uses list of critical test to generate a summary per site - binary value (good/bad) generated every 1h Metric generator integrates the summaries over time period (1 day…) to generate availability metric

Prototype sites availability metric Shows only availability of computing resources

CIC Dashboard Main tool for CIC-on-duty Makes CIC-on-duty job much easier Integrated view of monitoring tools (summary) - shows only failures and assigned tickets Detailed site view with table of open tickets and links to monitoring results Single tool for ticket creation and notification emails with detailed problem categorisation and templates Ticket browser with highlighting expired tickets Well maintained - adapts quickly to new requirements/suggestions (thanks to developers!)

CIC Dashboard Problem categories •` •` Sites list (reporting new problems) Test summary (SFT,GSTAT) GGUS Ticket status

CIC-on-duty operations CIC-on-duty: currently 6 teams (CERN, IN2P3, RAL, INFN, Russia, Taipei) working in weekly shifts The operators look at emerging alarms (CIC Dashboard) and the monitoring tools (for details) the and report problems Problems are submitted as tickets to GGUS (Remedy based) and both ROC and sites (Resource Centers – RC) are notified ROC is responsible for timely problem solution - otherwise ticket is escalated Priorities and deadlines for tickets are set depending on site size (number of CPUs) Everything here is described in details in Operations Manual - primary document for CIC-on-duty

CIC-on-duty operations But! Currently only failures of computing resources (CEs) are dealt with automatically. Other services are partially covered as they are indirectly tested by the current tools - all sites failing the same test Top down approach: CIC-on-duty just spots and reports high-level problems and doesn’t solve them Problems are then dispatched to lower-level instances (ROC, RC) for further analysis and resolution CIC-on-duty provides expertise for ROCs (and sites) if necessary

Roadmap to service monitoring What do we have? Framework and tests for computing resources (SFT) and information system (GStat) Initial data schema that can be used for integration (R-GMA) Basic display and metric report for computing resources - good example and starting point for other services What do we need? Extend the SFT framework for service sensors requirements Sensors measuring availability and performance for all grid services Integrated display Alarm system Integrate all the pieces into existing operation tools

Responsibilities Framework: is being discussed between CERN and Lyon Coordination of sensors: Piotr (CERN) Service Responsible Class Comments SRM 2.1 Dave Kant C monitoring of Storage Elements t.b.d. LFC James Casey C/H FTS FTS support CE Piotr Nyczyk monitored by SFT today RB job monitor exists (few modifications needed) Top-level BDII Min-Hong Tsai can be integrated with GStat Site BDII H monitored by GStat today Myproxy Maarten Litmaath VOMS Valerio Venturi R-GMA Laurence Field Detailed plan with dates being worked out and agreed with the responsibles