CERN IT Department CH-1211 Geneva 23 Switzerland t LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services The WLCG Monitoring Vision Show stakeholders the state of the global WLCG infrastructure, and its historical evolution, in order to improve the availability and reliability of this infrastructure WLCG Monitoring - 2
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services What is monitoring for us? Service Availability/Reliability –Service Status provided –Availability + Reliability calculated Usage records –Gridftp, SRM, Grid Job execution –One record per task, or one record per state change Accounting information –Daily rollups of Usage Right now, distributed debugging/service management not in scope WLCG Monitoring - 3
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Double vision Sites and the experiments –Both have a view of this data –Very complicated stack to trace through Try to connect the two perspectives –Important for site managers and project management –Especially sites which support more than one experiment WLCG Monitoring - 4
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Complexity ! WLCG Monitoring - 5
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Strategy to simplify Operations –Delegate to regional entities –Provide some tools to have a “global” view Tools –Delegate to experts –Standardize on information interchange schemas + protocols Reporting –Lightweight metric collection Of metrics that are useful for site managers or project management –Reporting on top of this For project management WLCG Monitoring - 6
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Infrastructure and tools - Message Bus WLCG Monitoring Working Group –Aim to consolidate the current monitoring effort Single message bus for data interchange –With reliable message delivery –Message persistence Isolate producers and consumers from each other –Define the message schemas and protocols Provide bridges/adaptors as needed –NMWG ? WLCG Monitoring - 7
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Broker at the centre.. A Strategy for WLCG Monitoring - 8
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Leverage the underlying infrastructures WLCG is a virtual infrastructure built on top of other physical infrastructures Added value ? –From interoperation and exchange of information between the systems –Provide information not available only in one Don’t add too many layers –Enough exist already ! –E.g Our MoUs should be defined related to the SLA/MoU of the infrastructures WLCG Monitoring - 9
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services LHCOPN and monitoring Availability/Reliability –Provide E2E link status –Create bridge from LHCOPN monitoring to WLCG monitoring Usage records –At the individual flow level is too detailed –Summary statistics should be ok Aggregate rates as seen E2E No need to expose internal complexity –Always ask “How could a site admin use this?” Reporting –Operational statistics –MoU reporting WLCG Monitoring - 10
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services LHCOPN and Operations What’s the requirements of LHCOPN? –Notification of grid ‘users’ of Service interruptions Status of problem investigations –Mechanism for grid users to raise problems against LHC OPN GGUS is too complicated for the problem –‘300 supporters’,TPM in the loop Perhaps a simpler solution works for notifications –“Dashboard” (from Dans presentation) –Good experiences in CCRC’08 WLCG Monitoring - 11
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services MoU compliance reporting We agreed to try and measure MoU metrics during CCRC’08 –To evaluate if we can actually do it ! 12 ServiceMaximum delay in responding to operational problemsAverage availability measured on an annual basis Service interruptionDegradation of the capacity of the service by more than 50% Degradation of the capacity of the service by more than 20% During accelerator operation At all other times Acceptance of data from the Tier-0 Centre 12 hours 24 hours99%n/a Networking service to the Tier-0 Centre during accelerator operation 12 hours24 hours48 hours98%n/a Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres 24 hours48 hours 98% All other services – prime service hours 2 hour 4 hours98% All other services – other times 24 hours48 hours 97%
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Mapping to MoU Services 13 Tier-1 Grid Service ArcCE BDII CE FTS LFC MYPX OSGCE RB RGMA SE SRM SRMv2 VOBOX gCE gRB sBDII MoU Category Acceptance of data from Tier-0 * Networking Services to Tier-0 * Data-intensive analysis service, including networking to Tier-0 All Other Services Current availability is per-service Map grid services status (from SAM) to MoU categories –These are “custom” service availability calculations LHC OPN –Can provide “Networking services to/from T1” LHC OPN
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services ServiceMap What’s a ServiceMap? –It’s a gridmap with many different maps, showing different aspects of the WLCG infrastructure What’s the CCRC’08 ServiceMap? –Service ‘readiness’ –Service availability For VO critical services –VO Functional blocks A single place to see both the VO and the infrastructure view of the grid –For all stakeholders 14
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services CCRC’08 ServiceMap …Demo… 15
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Julia Andreeva, CERN, F2F meeting 16 What are the VO functional blocks ? Functional blocks for LHC experiments are similar to a large extent –Allows for a site to compare the service they provide for different experiments –e.g - functional blocks for ATLAS and CMS for CCRC08 Data archiving at T0 Data processing at T0 Data transfer from T0 CAF Data archiving at Tier1 Processing at Tier1 Data transfer T1- T1 Data transfer T1- T2 MC production at T2 Analysis at T2 Data transfer T2- T1 MC production at T2 Analysis at T2 Data transfer T2- T1 Data archiving at T0 Data processing at T0 Data transfer from T0 Processing at Tier1 Data transfer T1- T1 Data transfer T1- T2 CMS ATLAS T0 T1 T2
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Julia Andreeva, CERN, F2F meeting 17 Tier 0 Data archiving At T0 Data processing At T0 Transfer T0-T1CAF Tier 1 Tier 2 Data Archiving at T1 Processing At T1 Data transfer T1-T2 Data transfer T1-T1 MC production at T2 Analysis at T2 Data transfer T2-T1 Example of “Gridmap” visualisation for CCRC08 workflow for CMS (random choice of colour) CCRC08 target for number of slots for data processing Number of jobs running in parallel for data processing XXX Max, min, avg of number of parallel jobs over last 24 hours Data processing success rate X% Three top errors Subset of the same info for every Tier1 Moving mouse over T1_DE_FZK T1_FR_IN2P3 T1_IT_CNAF T1_TW_ ASGC T1_UK_ RAL T1_ ES_ PIC T1_US_FNAL New window On click List of useful URLs to the detailed data Moving mouse over
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Summary WLCG monitoring needs for LHCOPN are modest Providing service status information would satisfy MoU availability requirements –We calculate availability/reliability according to our algorithms –Need downtime information too for this How to satisfy MoU response time? –This is still a wider problem for us Test some simpler notification systems –elogger + RSS feed ? WLCG Monitoring - 18