CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GDB CERN, 4 th March 2008 James Casey WLCG Monitoring – some worked examples.

Slides:



Advertisements
Similar presentations
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG status update Messaging System for the Grid First experiences
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation in EGEE-III What does.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG Status update Daniel Rodrigues.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Visualization Ideas for Management Dashboards
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Julia Andreeva on behalf of the MND section MND review.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
1 Models for Monitoring James Casey, CERN WLCG Service Reliability Workshop 27th November, 2007.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
CERN IT Department CH-1211 Geneva 23 Switzerland t Michel Jouvin (GRIF/LAL) on behalf of James Casey (CERN) (All materials from J. Casey)
CERN IT Department CH-1211 Geneva 23 Switzerland t LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview.
Daniele Bonacorsi Andrea Sciabà
NGI and Site Nagios Monitoring
POW MND section.
Evolution of SAM in an enhanced model for monitoring the WLCG grid
A Messaging Infrastructure for WLCG
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey WLCG Monitoring – some worked examples

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overview What are the actual commodity solutions we can use? Messaging System –How does it perform ? How does the strategy affect existing systems? What new systems do/might we need ? WLCG Monitoring – some worked examples - 2

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Messaging Systems Flexible architecture: –Deliver messages, either in point to point (queue)… –… or multicast mode (topics) –Support Synchronous or Asynchronous communication. Reliable delivery of messages: –Provide reliability to the senders if required –Configurable persistency / Master-Slave. Highly Scalable: –Network of Brokers WLCG Monitoring – some worked examples - 3

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ActiveMQ Mature open-source implementation of these ideas –Top-level Apache project –Commercial support available from IONA Easy to integrate –Multiple language + transport protocol support Good performance characteristics –See later … Work done to integrate into our environment –RPMs, Quattor components + templates, LEMON alarms WLCG Monitoring – some worked examples - 4

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ActiveMQ WLCG Monitoring – some worked examples - 5

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Evaluated Parameters: 1) Number of Producers 2) Number of Consumers 3) Message Size 4) Message Number Measurement of timestamps: 1)Message Sent 2)Message on Broker 3)Message Received Results analysis: 1)Logs containing all information for each message 2)From logs, extract messages/second… 3)… and messageLag Evaluated Parameters: 1) Number of Producers 2) Number of Consumers 3) Message Size 4) Message Number Measurement of timestamps: 1)Message Sent 2)Message on Broker 3)Message Received Results analysis: 1)Logs containing all information for each message 2)From logs, extract messages/second… 3)… and messageLag Testing of ActiveMQ performance

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Test Summary Results summary: –Running for 6 weeks with no crashes –50 Million messages of various sizes (0 to 10 kB) forwarded to consumers –12 Million incoming messages from producers –Up to 40 Producers and 80 Consumers connected at the same time –Stable under highly irregular test pattern: Number of clients change Frequent client process kills Daily number of tests vary

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Results : Throughput > Consumers > Throughput ?? Consumer Bottleneck! With a larger number of producers, even more messages per second saturating the consumer.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Results : Flow Control Effective flow control Flow Control balances load on producers and consumers

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 3 Consumers Memory Overflow ( Slow consumers ) Test Run batch change Max 180 s! Results : Message Lag

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Failover … Currently testing failover strategies for high- availability –JDBC Based –Shared Filesystem –Pure Master-Slave Evaluation of performance characteristics and recovery operations needed This is an essential feature to have high- reliability service –First tests look promising WLCG Monitoring – some worked examples - 11

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Usages of Messaging We use it as an ‘integration bus’ –Use when systems want to share information E.g VO transfer systems publishing data rates to WLCG It’s another string to our bow When the application model fits well, then use it –E.g. Async communications, broadcast messages Don’t force applications to use it –Have other solutions too E.g “RESTful” web services a.la SAM Programmatic Interface WLCG Monitoring – some worked examples - 12

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ‘Standard’ Integration Patterns The same patterns are repeated in many of the following examples: –Gather results at many points –Collect the raw results and store in a database –Perform some operation on the raw results Summarisation, availability calculation, … –Publish the summarised results to many clients E.g. site monitoring, dashboards, … –Store historical data in a database and visualize via web client We provide ‘standard’ components to make this plug’n’play for many workflows WLCG Monitoring – some worked examples - 13

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Standard Pattern Example – Gridftp log analysis WLCG Monitoring – some worked examples - 14 Application Database archiver component Transparent Broker Network Messaging System Adaptor Standard process Standard components

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Examples Existing Systems –Service availability monitoring –gStat –Site Monitoring –Usage reporting New systems –Gridview report publication –MoU reporting –ServiceMap – VO metric integration All fit into the standard pattern mentioned before –Now for details… WLCG Monitoring – some worked examples - 15

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services SAM – Current Architecture WLCG Monitoring – some worked examples - 16

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Migrating to the regions WLCG Monitoring – some worked examples - 17

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Messaging based archiving and reporting WLCG Monitoring – some worked examples - 18

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services In Production - OSG RSV to SAM RSV – Resource and Service Validation –Uses Gratia as native transport –And OSG GOC provide a bridge to SAM WLCG Monitoring – some worked examples - 19

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Workplan Port ‘useful’ SAM tests into WLCG probe format –And supplement where needed Move with 1 or 2 nagios-expert ROCs to have them run the SAM tests for their region –Setting up a standard execution environment at ROC will be key Slowly migrate ROCs over the course of EGEE-III Provide VO specific nagios at CERN for experiment specific tests (where needed) WLCG Monitoring – some worked examples - 20

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Site Monitoring WLCG working group focus on site monitoring –Problem reporting close to the source SAM becoming regional doesn’t change this –Hybrid monitoring strategy Site tests often only relevant for site admin –Not for availability/reliability calculation WLCG Monitoring – some worked examples - 21 Source# checks / service Type Central1-2Network monitoring, Service ‘Ping’ Regional5-10User-oriented actions (e.g existing SAM tests) Site local10-30Detailed functional tests

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Nagios EGEE CEE work on Nagios-based testing for grid services –Basis for future SAM execution framework –Uses WLCG probe format Compatible with tests from OSG RSV –Publishes into ActiveMQ for subset of tests Can be used to augment availability New version (ncg v2) is ready –Meet needs of sites to integrate into existing nagios installs No need for gLite UI on Nagios server Partitioned config file generated – include what you need WLCG Monitoring – some worked examples - 22

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services New Feature :Multiple VO reporting All SAM tests for all VOs supported at your site WLCG Monitoring – some worked examples - 23

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Workplan Documentation to be finished Then deployment is key –Do we need YAIM for Nagios to help small sites? Add features as requested from site admins –E.g. Nagios acknowledgements become entries in a related GGUS ticket WLCG Monitoring – some worked examples - 24

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Usage Reporting Other main part of monitoring – Usage statistics –Gridftp transfers, FTS transfers, job records, … Used to calculate throughput and reliability Currently handled in GridView, Dashboards –Use messaging system to unite these efforts Delegate parsing/routing of specific information back to experts –L&B, FTS, … WLCG Monitoring – some worked examples - 25

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Architecture WLCG Monitoring – some worked examples - 26

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Workplan Dashboard team working on getting non-RB job information into L&B direct from Condor submission Gridview converting gridftp publishers to messaging system Message formats defined for records –First records have been sent Will add new message types, such as SRM transfers WLCG Monitoring – some worked examples - 27

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services MB Reporting Currently a post-processing of results and graphs in Excel –Trying to implement it directly on the GridView DB Using a mature open-source reporting toolkit – JasperReports –UI Report builder – iReports –Web-based report server - OpenReports WLCG Monitoring – some worked examples - 28

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services OpenReports WLCG Monitoring – some worked examples - 29

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services MoU compliance reporting We agreed to try and measure MoU metrics during CCRC’08 –To evaluate if we can actually do it ! 30 ServiceMaximum delay in responding to operational problemsAverage availability measured on an annual basis Service interruptionDegradation of the capacity of the service by more than 50% Degradation of the capacity of the service by more than 20% During accelerator operation At all other times Acceptance of data from the Tier-0 Centre 12 hours 24 hours99%n/a Networking service to the Tier-0 Centre during accelerator operation 12 hours24 hours48 hours98%n/a Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres 24 hours48 hours 98% All other services – prime service hours 2 hour 4 hours98% All other services – other times 24 hours48 hours 97%

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Response time reporting workflow 31 Site Acknowledge New Problem! Site Fixed VO Confirmed Problem Solved ! :30 – Site Acknowledged – working on it ! :30 – New Problem. VO: Atlas, MoU Area: Distribution of data toTier-1 centres, Site: CERN-PROD - SRM not working :49 – Site Fixed – We’ve found the problem in the endpoint, restarted :43 – VO Confirmed – All working again, thanks ! Problem Report: Issue ID #42 : :30 : MoU Area: CERN-PROD/ Distribution of data to Tier-1 Centres Time to First Response : 1:00 Time to Problem resolved : 1:29 Time to VO confirmation : 2:23

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Measuring MoU availability 32 Experiment Framework/ Dashboard View Operational Testing (SAM/SLS) View “Human” ` View - the control

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Mapping to MoU Services 33 Tier-1 Grid Service ArcCE BDII CE FTS LFC MYPX OSGCE RB RGMA SE SRM SRMv2 VOBOX gCE gRB sBDII MoU Category Acceptance of data from Tier-0 * Networking Services to Tier-0 * Data-intensive analysis service, including networking to Tier-0 All Other Services Map grid services status (from SAM) to MoU categories –These are “custom” service availability calculations Use the CMS SAM portal framework as basis for implementing this ? –And send results direct to Tier-1 Nagios

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services MoU Portal 34

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ServiceMap What’s a ServiceMap? –It’s a gridmap with many different maps, showing different aspects of the WLCG infrastructure What’s the CCRC’08 ServiceMap? –Service ‘readiness’ –Service availability For VO critical services –Experiment Metrics A single place to see both the VO and the infrastructure view of the grid 35

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services CCRC’08 ServiceMap …Demo… 36

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services VO Metric monitoring Use messaging as interface between VO and the ServiceMap VO publishes statistics from dashboard /framework from their perspectice –Job reliability –Data transfer rates –Links back to their application for more information ServiceMap becomes launchpad for getting to all operational monitoring data –For the infratructures and the VOs WLCG Monitoring – some worked examples - 37

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Julia Andreeva, CERN, F2F meeting 38 What are the VO functional blocks ? Functional blocks for LHC experiments are similar to a large extent –Allows for a site to compare the service they provide for different experiments –e.g - functional blocks for ATLAS and CMS for CCRC08 Data archiving at T0 Data processing at T0 Data transfer from T0 CAF Data archiving at Tier1 Processing at Tier1 Data transfer T1- T1 Data transfer T1- T2 MC production at T2 Analysis at T2 Data transfer T2- T1 MC production at T2 Analysis at T2 Data transfer T2- T1 Data archiving at T0 Data processing at T0 Data transfer from T0 Processing at Tier1 Data transfer T1- T1 Data transfer T1- T2 CMS ATLAS T0 T1 T2

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Julia Andreeva, CERN, F2F meeting 39 Monitoring of the Experiments workflows Phedex CMS Dashboard ATLAS Dashboard Monalisa repository for ALICE DIRAC UI Based on GridMap view application Metrics and targets are defined and measured by the experiments according to their workflows Repository for common aggregated metrics GRID publisher

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Julia Andreeva, CERN, F2F meeting 40 Tier 0 Data archiving At T0 Data processing At T0 Transfer T0-T1CAF Tier 1 Tier 2 Data Archiving at T1 Processing At T1 Data transfer T1-T2 Data transfer T1-T1 MC production at T2 Analysis at T2 Data transfer T2-T1 Example of “Gridmap” visualisation for CCRC08 workflow for CMS (random choice of colour) CCRC08 target for number of slots for data processing Number of jobs running in parallel for data processing XXX Max, min, avg of number of parallel jobs over last 24 hours Data processing success rate X% Three top errors Subset of the same info for every Tier1 Moving mouse over T1_DE_FZK T1_FR_IN2P3 T1_IT_CNAF T1_TW_ ASGC T1_UK_ RAL T1_ ES_ PIC T1_US_FNAL New window On click List of useful URLs to the detailed data Moving mouse over

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Summary Have a standard model for many of the monitoring data/workflows –And components that can help us Now it’s an integration exercise –Augment existing components to talk together A lot of work to do But it will benefit us in the long term –Better sharing of knowledge/resources –Reduce operational effort –Easier to acquire personnel who know the commodity technologies WLCG Monitoring – some worked examples - 41

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Links ActiveMQ Nagios JasperReports OSG RSV CCRC’08 ServiceMap WLCG Monitoring – some worked examples - 42

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Extra Slides WLCG Monitoring – some worked examples - 43

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services gStat – monitoring the BDIIs gStat focuses on monitoring at Site BDII level Need to consider: –Validation of information close to source (GRIS) –Monitoring of performance of top-level BDII New work to analyse top level BDII performance (Laurence Field/ Felix Ehm) –Use Nagios as execution framework –WLCG probe-format compatible probes (in progress) –Custom presentation layer WLCG Monitoring – some worked examples - 44 Good example of using the commodity software approach to speed up development and deployment !

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Top BDII Summary WLCG Monitoring – some worked examples - 45 How many hosts behind this load-balanced alias ? How many CE clusters reference this top-level BDII? How many services in this BDII? Response time ?

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Drill to Nagios for details WLCG Monitoring – some worked examples - 46 Drill-down to historical view

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services WLCG Monitoring – some worked examples - 47