WLCG Monitoring Roadmap Julia Andreeva, CERN 25.04.2008, WLCG workshop, CERN.

Slides:

Advertisements

Similar presentations

CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.

Advertisements

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.

CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,

Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

RSV: OSG Grid Fabric Monitoring and Interoperation with WLCG Monitoring Systems Rob Quick, Arvind Gopu, and Soichi Hayashi Computing in High Energy and.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.

ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.

Julia Andreeva on behalf of the MND section MND review.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.

GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,

Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,

CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,

CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.

1 Models for Monitoring James Casey, CERN WLCG Service Reliability Workshop 27th November, 2007.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.

Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April

Daniele Bonacorsi Andrea Sciabà

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

Key Activities. MND sections

POW MND section.

Evolution of SAM in an enhanced model for monitoring the WLCG grid

Experiment Dashboard overviw of the applications

A Messaging Infrastructure for WLCG

Monitoring of the infrastructure from the VO perspective

Presentation transcript:

WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN

Julia Andreeva, CERN WLCG Workshop 2 Too many monitoring tools or not enough? How do I find what I need? Improve publishing of the monitoring data aiming to satisfy users with various roles and different level of expertise Whether I can trust what I found? Improve reliability and completeness of the provided monitoring information How long does it take to be notified about the problem I am supposed to fix ? Decrease the problem response time and provide max relevant information to resolve the problem And finally if we can give positive answers to all these questions – how much does it cost to provide this functionality? Decrease effort for supporting of the existing monitoring applications and for development of the new ones

Julia Andreeva, CERN WLCG Workshop 3 Regarding the near future The European model for operating of the infrastructure is moving towards the distributed system with significantly less effort. The strategy for WLCG monitoring should consider these factors and contribute effectively to the task of operating of the infrastructure Accent to Site monitoring Data taking! Good monitoring is one of the vital conditions for providing of the reliable service where data will be processed LHC VOs rely on the VO Monitoring Systems. Publish information collected there in a consistent way for all 4 VOs, making it available for the LHC community outside particular VO.

Julia Andreeva, CERN WLCG Workshop 4 Main components of monitoring system Local sensors and Remote tests Transfer mechanism Data repository Data processing (Aggregation, archiving..) Data publishing Many, but not always providing reliable information about the health of the service Need to share the experience and avoid duplication of effort No middleware component in place for this purpose Variety of possible solutions are tried, Multiple repositories No concept of proper sharing of information Not always consistent Multiple interfaces but no global consistent view of what is going on and what is the status of the infrastructure APIs very often do not exist How to improve?

Julia Andreeva, CERN WLCG Workshop 5 Main components of monitoring system Local sensors and Remote tests Transfer mechanism Data repository Data processing (Aggregation, archiving..) Data publishing Central repository of sensors Sharing experience in developing local probes and SAM tests Providing standards and guidelines (Grid monitoring probe specification) Scalable and reliable Grid messaging system based on ActiveMQ (mature open source) Common format for Data exchange Avoid copying of data of common Interest (topology) in various DBs (source of Inconsistencies), rather use DB Cash Process data inside DB, relying on ORACLE functionality In addition to UI always provide information in machine readable format, which can be consumed by applications. Consistent with Proposed standards These improvements are the outcome of The WLCG Grid Service Monitoring Working Group chaired by James Casey and Ian Neilson

Julia Andreeva, CERN WLCG Workshop 6 ActiveMQ James Casey Apache top-level Project Reliable Scalable Easy to integrate Good performance

Julia Andreeva, CERN WLCG Workshop 7 Test Summary Results summary: –Running for 6 weeks with no crashes –50 Million messages of various sizes (0 to 10 kB) forwarded to consumers –12 Million incoming messages from producers –Up to 40 Producers and 80 Consumers connected at the same time –Stable under highly irregular test pattern:  Number of clients change  Frequent client process kills  Daily number of tests vary James Casey

Julia Andreeva, CERN WLCG Workshop 8 In Production - OSG RSV to SAM RSV – Resource and Service Validation –Uses Gratia as native transport –And OSG GOC provide a bridge to SAM James Casey The common design pattern: Message oriented monitoring system Other use cases: Reporting usage information (jobs, transfer) to central usage repository or to Experiment Dashboards Reporting information from Experiment specific Monitoring systems to central metrics store …… Other clients can subscribe to a particular topic

Julia Andreeva, CERN WLCG Workshop 9 Julia Andreeva, CERN, WLCG workshop 9 Publishing monitoring info to the site fabric monitoring The goal is to decrease the time needed for detecting and fixing of the problems at the site and not to force site admin to look in multiple monitoring pages in order to understand the situation with Grid services at his own site. Results of SAM tests are published into local fabric monitoring system ( Nagios ). Can be experiment specific tests (Caltech for CMS). Based on these results alarm can be raised The prototype is put in place and was tried by several sites For more details see yesterday monitoring tutorial:

Julia Andreeva, CERN WLCG Workshop 10 Increasing reliability of the provided information Should be addressed in various directions. One of the examples is job reliability information. Big room for improvement Using of the Grid messaging system should improve reliability of job monitoring information. Currently jobs submitted via condor_g (bypassing RB) are not taken into account by the monitoring systems, except Experiment Dashboard. Condor_g submitter is instrumented to report job status changes to any monitoring system keeping track of jobs (Gratia, GridView, Dashboard…). Various transport protocols including Grid messaging system can be used.

Julia Andreeva, CERN WLCG Workshop 11 Using common patterns, tools, frameworks Where possible decrease number of implementations of the monitoring applications and use common solutions. The Experiment Dashboard framework is a good example of the framework which was successfully adopted by other systems, like ATLAS Data Management. Currently couple of new applications external to the Dashboard monitoring system are being developed in the Dashboard framework : the interface for administration of the FTS channels, FTS monitoring for Tier0. Another good example is “Brian’s” library, the graphical library which was developed by Brian Bockelman for CMS Phedex system. Had been integrated with Dashboard and widely used there. Most probably will be used in the new SAM portal being developed in the Dashboard framework

Julia Andreeva, CERN WLCG Workshop 12 User Community and User Interfaces User community is not homogeneous => different requirements regarding monitoring tools in general and UI in particular. Main categories: VOs = > Experiment Specific Monitoring systems (Dashboard, Phedex, Dirac, MonAlisa). Keep a lot of data about VO activity as well as about the state of the infrastructure from the VO perspective Service Providers => Set of probes allowing to define the health of the service should be provided as a part of the middleware, or at least the developers should be involved in the development of probes. Guidelines are defined by the Grid Service Monitoring Working Group Site administrators =>Local Fabric monitoring with publishing monitoring information from central monitoring tools like SAM in local fabric As well need to understand whether VOs are able to perform their workflow at the site WLCG Management => Need the global view to understand how it is going + generated reports It might be not possible to develop interface which would make everyone happy Good interface requires several iterations and feedback from the community The main idea we are trying to follow is providing the global view with possibility to drill down for details Gridmap is a good example

Julia Andreeva, CERN WLCG Workshop 13 Automatic report generation for WLCG/EGEE management and operations Developed by GridView Using JasperReport, popular open source report generation tool Supports various formats like pdf, rtf, xls, csv, html etc Simple web portal available ao all can generate relevant reports

Julia Andreeva, CERN WLCG Workshop 14 Julia Andreeva, CERN, WLCG workshop 14 Gridmap for monitoring of the services defined as critical by the LHC VOs Gridmap visualization tool developed by Max Boehm provides the global view with the possibility to drill down for detail. For CCRC08 put in place prototype to monitor services defined as critical by the LHC VOs. First step is to make sure that every box of the map has monitoring behind (SAM test, lemon sensor, etc…) Second step is to make sure that monitoring is reliable (service is really in a prefect health when it is green on the map) Currently information sources which are used are SAM or SLS (Panda can be added?) Working group of people having experience in monitoring field representing all LHC VOs chaired by Andrea Sciaba

Julia Andreeva, CERN WLCG Workshop 15 Gridmap for monitoring of the services defined as critical by the LHC VOs ATLAS ALICE CMS LHCb ATLAS

Julia Andreeva, CERN WLCG Workshop 16 Julia Andreeva, CERN, WLCG workshop 16 Gridmap for monitoring of the workflows of the Experiments The goal is to show whether the targets of the experiments for the functional blocks are met. This information is available in the monitoring systems of the experiments and currently is not being published anywhere in the consistent way. The most complicated part is how to define the status (colour) for every functional block. The targets are not defined for every item ( for example not necessary for every T1-T2 link), the target can change from one day to another. But for data taking we will have more stable workflow, this won’t be a problem any more

Julia Andreeva, CERN WLCG Workshop 17 Click on the box corresponding to the distributed service Get another map For a given activity at all Sites of T1 Gridmap for monitoring of the workflows of the Experiments

Julia Andreeva, CERN WLCG Workshop 18 Click on the link to navigate to other monitoring tool, Dashboard for more details Gridmap for monitoring of the workflows of the Experiments

Julia Andreeva, CERN WLCG Workshop 19 Shows current status of production jobs running at FNAL Gridmap for monitoring of the workflows of the Experiments

Julia Andreeva, CERN WLCG Workshop 20 The way the map is created is very flexible and easily configurable. Any other combinations are possible: -Combining in one map functional blocks of all experiments served by a given site -Clicking on a functional block get the list of services involved … -Multiple possibilities for drilling in and guiding user who can be not at all familiar with monitoring tools specific to the VO Gridmap for monitoring of the workflows of the Experiments

Julia Andreeva, CERN WLCG Workshop 21 Julia Andreeva, CERN, WLCG workshop 21 E-logger and GGUS During the first phase of CCRC08 one E-logger for observations and another one for MoU targets had been used. Proved to be very helpful However, there is a common feeling that there is an overlapping between e-logger and GGUS, people should not submit multiple bug-reports, there should be a single system in place. In fact it was an overlap by design to understand requirements and pass them to GGUS development team GGUS functionality will be extended to cover the MoU targets but not for May. Foresee using E-log for May run The first version of RSS exists as well, keeps the track of tickets which had been opened last 24 hours Please, try link and send your feedback:

Julia Andreeva, CERN WLCG Workshop 22 Define metrics During the discussion at the workshop there was a suggestion to reduce 3 ‘metrics’ we are currently using for CCRC08 : Critical services, Functional blocks and Scaling factors, MoU targets by just Functional blocks with Critical services as drill down We might need to get more experience with the new tools and to understand better correlations between the functional blocks and services in a real challenge environment to decide which metrics would give us better idea of how well we are performing.

Julia Andreeva, CERN WLCG Workshop 23 Summary Monitoring is one of the core conditions for improving of reliability of the infrastructure and for reducing operational effort As an outcome of the work done by the WLCG Monitoring Working Groups the main strategy for organization of the monitoring infrastructure had been defined. Now Monitoring WG is over, need to implement and deliver what had been defined We have a lot of tools in place, now need to tie them together There is still a lot to do and we have to use the CCRC08 challenge to check whether the monitoring system can provide the required functionality