Service Availability Monitoring Piotr Nyczyk, CERN/GD GDB Meeting CERN, 9th November 2005
Outline Aim of service availability monitoring Existing monitoring solution: Site Functional Tests GStat Integration using R-GMA Present use in grid operation infrastructure CIC Dashboard CIC-on-duty operations Future plans and responsibilities
Purpose Service availability monitoring: Two main goals: Monitor grid services (without going into the site fabric level) Verify SLAs and LCG MoUs It must be site independent and monitor all sites in a uniform way Monitor both quantity and quality To be used by grid operation teams: CIC on duty team Framework plus some metrics already in place: SFT and GStat
Existing monitoring solution
Site Functional Tests (SFT) Submits a short-living job to all sites (Computing Elements) to test various aspects of functionality (using WMS – RB): Job submission - CE availability: Can I submit a job and retrieve the output? Basic environment tests: software version, BrokerInfo, CSH scripts Security tests: CA certificates, CRLs, ... (work in progress) Data management tests: basic operations using lcg-utils on the default Storage Element (SE) and chosen “central” SE (usually at CERN - 3rd party replication tests) VO environment tests: tag management, software installation directory + VO specific job submission and tests (maintained by VOs, example: LHCb and dirac- test) What is not covered? Only CEs and batch farms are really tested - other services are tested indirectly (usually by causing failures in all sites at once): SEs, RB, top-level BDII, RLS/LFC, R- GMA registry/schema Maintained by Operations Team at CERN (+several external developers) and run as a cron job on LXPLUS (using AFS based LCG UI) Jobs submitted at least every 3 hours (+ on demand resubmissions)
SFT - report Shows results matrix with all sites Selection of “critical” tests for each VO to define which sites are good/bad Detailed test log available for troubleshooting and debugging Deployed on two machines at CERN (load distribution, fault tolerance)
GIIS Monitor (GStat) Monitoring tool for Information System: Periodically queries all Site BDIIs (but doesn’t monitor Top-level BDIIs) Checks if Site BDIIs are available Checks integrity of published information Checks for missing entities, attributes Detects and reports information about some of the Services: RB, MyProxy, LFC but doesn’t monitor them Detects duplicated services in some cases (eg. 2 global LFC servers a single VO) Cron job maintained and run by Min-Hong Tsai in Taipei - updates every 5 minutes
Integration using R-GMA R-GMA is now mature enough to be used as the “universal BUS” for monitoring information SFT and GStat are now publishing results to R-GMA GOC DB is used to get the list of sites and nodes to monitor, which are published to R-GMA together with scheduled downtime information We managed to solve most obvious scalability problems: ~170 sites About 3.5M tuples for 1 month history with full detail After one month only summary information
Prototype sites availability metric Using our current data schema and R-GMA we managed to integrate monitoring information from SFT and GStat Summary generator uses list of critical test to generate a summary per site - binary value (good/bad) generated every 1h Metric generator integrates the summaries over time period (1 day…) to generate availability metric
Prototype sites availability metric Shows only availability of computing resources
CIC Dashboard Main tool for CIC-on-duty Makes CIC-on-duty job much easier Integrated view of monitoring tools (summary) - shows only failures and assigned tickets Detailed site view with table of open tickets and links to monitoring results Single tool for ticket creation and notification emails with detailed problem categorisation and templates Ticket browser with highlighting expired tickets Well maintained - adapts quickly to new requirements/suggestions (thanks to developers!)
CIC Dashboard Problem categories •` •` Sites list (reporting new problems) Test summary (SFT,GSTAT) GGUS Ticket status
CIC-on-duty operations CIC-on-duty: currently 6 teams (CERN, IN2P3, RAL, INFN, Russia, Taipei) working in weekly shifts The operators look at emerging alarms (CIC Dashboard) and the monitoring tools (for details) the and report problems Problems are submitted as tickets to GGUS (Remedy based) and both ROC and sites (Resource Centers – RC) are notified ROC is responsible for timely problem solution - otherwise ticket is escalated Priorities and deadlines for tickets are set depending on site size (number of CPUs) Everything here is described in details in Operations Manual - primary document for CIC-on-duty
CIC-on-duty operations But! Currently only failures of computing resources (CEs) are dealt with automatically. Other services are partially covered as they are indirectly tested by the current tools - all sites failing the same test Top down approach: CIC-on-duty just spots and reports high-level problems and doesn’t solve them Problems are then dispatched to lower-level instances (ROC, RC) for further analysis and resolution CIC-on-duty provides expertise for ROCs (and sites) if necessary
Roadmap to service monitoring What do we have? Framework and tests for computing resources (SFT) and information system (GStat) Initial data schema that can be used for integration (R-GMA) Basic display and metric report for computing resources - good example and starting point for other services What do we need? Extend the SFT framework for service sensors requirements Sensors measuring availability and performance for all grid services Integrated display Alarm system Integrate all the pieces into existing operation tools
Responsibilities Framework: is being discussed between CERN and Lyon Coordination of sensors: Piotr (CERN) Service Responsible Class Comments SRM 2.1 Dave Kant C monitoring of Storage Elements t.b.d. LFC James Casey C/H FTS FTS support CE Piotr Nyczyk monitored by SFT today RB job monitor exists (few modifications needed) Top-level BDII Min-Hong Tsai can be integrated with GStat Site BDII H monitored by GStat today Myproxy Maarten Litmaath VOMS Valerio Venturi R-GMA Laurence Field Detailed plan with dates being worked out and agreed with the responsibles