CERN IT Department CH-1211 Geneva 23 Switzerland t LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview.

Slides:



Advertisements
Similar presentations
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Advertisements

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG status update Messaging System for the Grid First experiences
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey WLCG Monitoring – some worked examples.
Visualization Ideas for Management Dashboards
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Julia Andreeva on behalf of the MND section MND review.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
CERN - IT Department CH-1211 Genève 23 Switzerland CCRC Tape Metrics Tier-0 Tim Bell January 2008.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.
CERN IT Department CH-1211 Geneva 23 Switzerland t James Casey CCRC’08 April F2F 1 April 2008 Communication with Network Teams/ providers.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.
CERN IT Department CH-1211 Geneva 23 Switzerland t Michel Jouvin (GRIF/LAL) on behalf of James Casey (CERN) (All materials from J. Casey)
Flexible Availability Computation Engine for WLCG Rajesh Kalmady, Phool Chand, Vaibhav Kumar, Digamber Sonvane, Pradyumna Joshi, Vibhuti Duggal, Kislay.
CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
Daniele Bonacorsi Andrea Sciabà
WLCG IPv6 deployment strategy
WLCG Workshop 2017 [Manchester] Operations Session Summary
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
Key Activities. MND sections
POW MND section.
Pedro Andrade ACE Status Update Pedro Andrade
Jan 12, 2005 Improving CMS data transfers among its distributed Computing Facilities N. Magini CERN IT-ES-VOS, Geneva, Switzerland J. Flix Port d'Informació.
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Experiment Dashboard overviw of the applications
March Availability Report for EGEE Sites based on Nagios
A Messaging Infrastructure for WLCG
SAM Alarm Triggering and Masking
Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT)
Monitoring of the infrastructure from the VO perspective
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
Site availability Dec. 19 th 2006
IPv6 update Duncan Rand Imperial College London
Presentation transcript:

CERN IT Department CH-1211 Geneva 23 Switzerland t LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services The WLCG Monitoring Vision Show stakeholders the state of the global WLCG infrastructure, and its historical evolution, in order to improve the availability and reliability of this infrastructure WLCG Monitoring - 2

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services What is monitoring for us? Service Availability/Reliability –Service Status provided –Availability + Reliability calculated Usage records –Gridftp, SRM, Grid Job execution –One record per task, or one record per state change Accounting information –Daily rollups of Usage Right now, distributed debugging/service management not in scope WLCG Monitoring - 3

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Double vision Sites and the experiments –Both have a view of this data –Very complicated stack to trace through Try to connect the two perspectives –Important for site managers and project management –Especially sites which support more than one experiment WLCG Monitoring - 4

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Complexity ! WLCG Monitoring - 5

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Strategy to simplify Operations –Delegate to regional entities –Provide some tools to have a “global” view Tools –Delegate to experts –Standardize on information interchange schemas + protocols Reporting –Lightweight metric collection Of metrics that are useful for site managers or project management –Reporting on top of this For project management WLCG Monitoring - 6

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Infrastructure and tools - Message Bus WLCG Monitoring Working Group –Aim to consolidate the current monitoring effort Single message bus for data interchange –With reliable message delivery –Message persistence Isolate producers and consumers from each other –Define the message schemas and protocols Provide bridges/adaptors as needed –NMWG ? WLCG Monitoring - 7

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Broker at the centre.. A Strategy for WLCG Monitoring - 8

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Leverage the underlying infrastructures WLCG is a virtual infrastructure built on top of other physical infrastructures Added value ? –From interoperation and exchange of information between the systems –Provide information not available only in one Don’t add too many layers –Enough exist already ! –E.g Our MoUs should be defined related to the SLA/MoU of the infrastructures WLCG Monitoring - 9

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services LHCOPN and monitoring Availability/Reliability –Provide E2E link status –Create bridge from LHCOPN monitoring to WLCG monitoring Usage records –At the individual flow level is too detailed –Summary statistics should be ok Aggregate rates as seen E2E No need to expose internal complexity –Always ask “How could a site admin use this?” Reporting –Operational statistics –MoU reporting WLCG Monitoring - 10

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services LHCOPN and Operations What’s the requirements of LHCOPN? –Notification of grid ‘users’ of Service interruptions Status of problem investigations –Mechanism for grid users to raise problems against LHC OPN GGUS is too complicated for the problem –‘300 supporters’,TPM in the loop Perhaps a simpler solution works for notifications –“Dashboard” (from Dans presentation) –Good experiences in CCRC’08 WLCG Monitoring - 11

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services MoU compliance reporting We agreed to try and measure MoU metrics during CCRC’08 –To evaluate if we can actually do it ! 12 ServiceMaximum delay in responding to operational problemsAverage availability measured on an annual basis Service interruptionDegradation of the capacity of the service by more than 50% Degradation of the capacity of the service by more than 20% During accelerator operation At all other times Acceptance of data from the Tier-0 Centre 12 hours 24 hours99%n/a Networking service to the Tier-0 Centre during accelerator operation 12 hours24 hours48 hours98%n/a Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres 24 hours48 hours 98% All other services – prime service hours 2 hour 4 hours98% All other services – other times 24 hours48 hours 97%

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Mapping to MoU Services 13 Tier-1 Grid Service ArcCE BDII CE FTS LFC MYPX OSGCE RB RGMA SE SRM SRMv2 VOBOX gCE gRB sBDII MoU Category Acceptance of data from Tier-0 * Networking Services to Tier-0 * Data-intensive analysis service, including networking to Tier-0 All Other Services Current availability is per-service Map grid services status (from SAM) to MoU categories –These are “custom” service availability calculations LHC OPN –Can provide “Networking services to/from T1” LHC OPN

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ServiceMap What’s a ServiceMap? –It’s a gridmap with many different maps, showing different aspects of the WLCG infrastructure What’s the CCRC’08 ServiceMap? –Service ‘readiness’ –Service availability For VO critical services –VO Functional blocks A single place to see both the VO and the infrastructure view of the grid –For all stakeholders 14

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services CCRC’08 ServiceMap …Demo… 15

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Julia Andreeva, CERN, F2F meeting 16 What are the VO functional blocks ? Functional blocks for LHC experiments are similar to a large extent –Allows for a site to compare the service they provide for different experiments –e.g - functional blocks for ATLAS and CMS for CCRC08 Data archiving at T0 Data processing at T0 Data transfer from T0 CAF Data archiving at Tier1 Processing at Tier1 Data transfer T1- T1 Data transfer T1- T2 MC production at T2 Analysis at T2 Data transfer T2- T1 MC production at T2 Analysis at T2 Data transfer T2- T1 Data archiving at T0 Data processing at T0 Data transfer from T0 Processing at Tier1 Data transfer T1- T1 Data transfer T1- T2 CMS ATLAS T0 T1 T2

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Julia Andreeva, CERN, F2F meeting 17 Tier 0 Data archiving At T0 Data processing At T0 Transfer T0-T1CAF Tier 1 Tier 2 Data Archiving at T1 Processing At T1 Data transfer T1-T2 Data transfer T1-T1 MC production at T2 Analysis at T2 Data transfer T2-T1 Example of “Gridmap” visualisation for CCRC08 workflow for CMS (random choice of colour) CCRC08 target for number of slots for data processing Number of jobs running in parallel for data processing XXX Max, min, avg of number of parallel jobs over last 24 hours Data processing success rate X% Three top errors Subset of the same info for every Tier1 Moving mouse over T1_DE_FZK T1_FR_IN2P3 T1_IT_CNAF T1_TW_ ASGC T1_UK_ RAL T1_ ES_ PIC T1_US_FNAL New window On click List of useful URLs to the detailed data Moving mouse over

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Summary WLCG monitoring needs for LHCOPN are modest Providing service status information would satisfy MoU availability requirements –We calculate availability/reliability according to our algorithms –Need downtime information too for this How to satisfy MoU response time? –This is still a wider problem for us Test some simpler notification systems –elogger + RSS feed ? WLCG Monitoring - 18