EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.

Slides:



Advertisements
Similar presentations
WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC.
Advertisements

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The network monitoring in grid context Operations.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Dashboard program of work Julia Andreeva GS Group meeting
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Design of an Expert System for Enhancing.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Julia Andreeva on behalf of the MND section MND review.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia
Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Using HLRmon for advanced visualization of resource usage Enrico Fattibene INFN - CNAF ISCG 2010 – Taipei March 11 th, 2010.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks IT ROC: Vision for EGEE III Tiziana Ferrari.
WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
WLCG Accounting Task Force Introduction Julia Andreeva CERN 9 th of June,
Daniele Bonacorsi Andrea Sciabà
Key Activities. MND sections
POW MND section.
Experiment Dashboard overviw of the applications
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of the LHC computing activities on the GRID Julia Andreeva, CERN (IT/GS) NEC09, September 2009, Varna, Bulgaria

Enabling Grids for E-sciencE EGEE-III INFSO-RI Outline  Monitoring from the VO perspective, motivation  Overview of the existing VO-specific monitoring systems and their role in operating of the WLCG infrastructure  Experiment Dashboard as an example of the common application used by all LHC VOs.  High-level cross-VO view based on data from VO- specific systems  Summary NEC09, Varna, Julia Andreeva (CERN, IT/GS) 2

Enabling Grids for E-sciencE EGEE-III INFSO-RI Monitoring from the VO perspective- why is it so important?  We are still accumulating practical experience operating Grid infrastructure  We are not yet aware of all possible troubles which could happen to the infrastructure as a whole or to individual components  As a consequence, we do not yet have enough knowledge to create perfect monitoring system which would alarm us in any critical situation or even better would predict such situation before it happens  All this implies considerable involvement of the user community to the operations  As current experience shows (CCRC08,STEP09 and beyond) as a rule VO communities, in particular people taking computing shifts, are those who detect problems in the first place  VO monitoring tools are the main monitoring instrument for the moment. They are aggregating and promptly adapting new experience in operating of the Grid infrastructure NEC09, Varna, Julia Andreeva (CERN, IT/GS) 3

Enabling Grids for E-sciencE EGEE-III INFSO-RI Main areas covered by the VO monitoring tools  Job processing (sharing and usage of the resources, performance, reasons of the failures and correspondingly related problems with the involved Grid services or VO applications)  Data transfer (throughput, efficiency, reasons for the failures and related problems with the involved Grid services)  Overall status of sites serving a given VO (site commissioning, computing shifts) NEC09, Varna, Julia Andreeva (CERN, IT/GS) 4

Enabling Grids for E-sciencE EGEE-III INFSO-RI Variety of tools used by the LHC VOs  ALICE - MonAlisa for job processing -MonAlisa and Experiment Dashboard for data transfer  ATLAS -Panda and Experiment Dashboard for job processing -Experiment Dashboard for data transfer  CMS - ProdMon and Experiment Dashboard for job processing - Phedex for data transfer  LHCb -Dirac for job processing -Dirac for data transfer All experiments are using -SAM and Experiment Dashboard for monitoring of site status and status of the services at the sites -SLS for monitoring of services at Tier0 - NEC09, Varna, Julia Andreeva (CERN, IT/GS) 5

Enabling Grids for E-sciencE EGEE-III INFSO-RI ALICE example NEC09, Varna, Julia Andreeva (CERN, IT/GS) 6 Monitoring system of ALICE based on MonAlisa monitoring systems. Monalisa services at all ALICE sites for site-level monitoring + MonAlisa repository for a high-level view on the scope of the ALICE VO

Enabling Grids for E-sciencE EGEE-III INFSO-RI Monitoring of ATLAS DDM NEC09, Varna, Julia Andreeva (CERN, IT/GS) 7 Monitoring of ATLAS DDM is implemented in Dashboard framework. The information sources are ATLAS DDM services at the sites. Data repository is implemented in ORACLE backend located at CERN. Widely used by ATLAS community. Up to 1K unique visitors per month, More that 100K pages are viewed daily

Enabling Grids for E-sciencE EGEE-III INFSO-RI CMS example NEC09, Varna, Julia Andreeva (CERN, IT/GS) 8 Monitoring of the CMS transfers is coupled with the CMS Data distribution system PhEDEX. Provides information about transfer rate, transfer quality, the status of the queue for transfer requests, etc… For CMS Job Monitoring see next talk of Irina Sidorova

Enabling Grids for E-sciencE EGEE-III INFSO-RI Monitoring of the LHCb computing activities by Dirac NEC09, Varna, Julia Andreeva (CERN, IT/GS) 9 In LHCb both Data transfer and job monitoring are provided by Dirac

Enabling Grids for E-sciencE EGEE-III INFSO-RI Experiment Dashboard as an example of system used by 4 LHC VOs NEC09, Varna, Julia Andreeva (CERN, IT/GS) 10  Experiment Dashboard is in production for 4 LHC VOs  Widely used by the experiments for their everyday work (3K unique visitors (unique IP addresses) of the CMS production server in August 2009)  Covers full range of the LHC computing activities  Works transparently across various Grid infrastructures  Developed as a result of the joined effort of the Dashboard team, developers in the LHC experiments and in other monitoring projects. In collaboration with institutes from Taiwan, Russia, France and Great Britain

Enabling Grids for E-sciencE EGEE-III INFSO-RI Collaboration with JINR and other Russian institutions  Russia is actively participating in the WLCG monitoring activity, namely contributing to the Dashboard project. From Russian side this work is coordinated by Vladimir Korenkov  Strong contribution from JINR: Irina Sidorova Elena Tikhonenko Sergey Belov Sergey Mitsyn Alexander Uzhinskiy Andrey Nechaevskiy Among our JINR colleagues there are many young developers recently graduated from Dubna University  Very much hope that this collaboration will continue NEC09, Varna, Julia Andreeva (CERN, IT/GS) 11

Enabling Grids for E-sciencE EGEE-III INFSO-RI Experiment Dashboard applications NEC09, Varna, Julia Andreeva (CERN, IT/GS) 12 Generic applications: Job Monitoring Task monitoring for the analysis users Site availability based on SAM tests Site Status Board VO-specific applications: ALICE Data Transfer Monitoring ATLAS Data Management Monitoring ATLAS Production Monitoring Central Repository for Production Monitoring Data for CMS

Enabling Grids for E-sciencE EGEE-III INFSO-RI Development principles  Do not develop and deploy new sensors unless nothing is in place for a given purpose  Where possible use common solutions (technology and implementation). All Dashboard applications regardless of their functionality and information sources are developed in the Dashboard framework  Involvement users in the development process NEC09, Varna, Julia Andreeva (CERN, IT/GS) 13

Enabling Grids for E-sciencE EGEE-III INFSO-RI NEC09, Varna, Julia Andreeva (CERN, IT/GS) 14 Experiment Dashboard Framework Information sources UI Data storage and aggregation Dashboard Data Collecting Agents DB Access Layer (DAO) Machine- readable format publisher Other applications Dashboard agents System is modular This allows to have flexible approach while implementing needs of the customers

Enabling Grids for E-sciencE EGEE-III INFSO-RI NEC09, Varna, Julia Andreeva (CERN, IT/GS) 15 Experiment Dashboard Framework (Examples) Information sources UI Data storage and aggregation Dashboard Data Collecting Agents DB Access Layer (DAO) Machine- readable format publisher Other applications Dashboard agents For ATLAS Data Management Monitoring all components are implemented

Enabling Grids for E-sciencE EGEE-III INFSO-RI NEC09, Varna, Julia Andreeva (CERN, IT/GS) 16 Experiment Dashboard Framework (Examples) Information sources UI Data storage and aggregation Dashboard Data Collecting Agents DB Access Layer (DAO) Machine- readable format publisher Other applications Dashboard agents For CMS Production Monitoring Dashboard is used to store, aggregate and archive data and to publish it in XML format. While UI is developed by the CMS Production Team

Enabling Grids for E-sciencE EGEE-III INFSO-RI NEC09, Varna, Julia Andreeva (CERN, IT/GS) 17 Experiment Dashboard Framework (Examples) Information sources UI Data storage and aggregation Dashboard Data Collecting Agents DB Access Layer (DAO) Machine- readable format publisher Other applications Dashboard agents For new SAM portal information is not imported into Dashboard DB. In SAM DB some additional tables are created and availability calculations are implemented inside ORACLE SAM instance. Dashboard is used only to create monitoring display and to publish data in the machine- readable format

Enabling Grids for E-sciencE EGEE-III INFSO-RI Make users to take part in the development Monitoring applications are successful when they are developed in the close collaboration with user community. Good examples are Site Status Board and Dashboard Site Availability application based on SAM tests  CMS Experiment over last year put a lot of effort in site commissioning activity  Monitoring is a vital component of this process  New applications had been developed in close collaboration between Dashboard team and members of the CMS community involved in the site commissioning activity  Initially developed for CMS, Dashboard Site Availability application had been requested by other LHC VOs. Now in production for all 4 LHC VOs.  Same for Site Status Board, had been developed for CMS was later requested by ALICE and LHCb. Dashboard plots demonstrating improvement of the quality of the sites used by CMS.

Enabling Grids for E-sciencE EGEE-III INFSO-RI Make users to take part in the development Site Status Board These are users (people taking part in computing shifts and site commissioning activity) who define the set of columns, their content, which metrics are considered for overall status of the site, what is the validity interval for a given metric, which columns should be shown in the UI by default, alternative views, etc… Dashboard provides a framework to be filled in by the customized information. Historical information as well as straight forward navigation to the primary information source is available

Enabling Grids for E-sciencE EGEE-III INFSO-RI High level cross-VO view  The VO-specific monitoring systems are working in the scope of a single experiment  Non-expert users or users external to a given VO do not know how to find required information  It is difficult if at all possible to compare and correlate information of different VOs. Global cross-VO view is missing  Recent development aims to solve this problem  The systems providing high level view are being designed. They are based on integration of the experiment-specific monitoring systems, Dashboard framework and GridMap visualization system NEC09, Varna, Julia Andreeva (CERN, IT/GS) 20

Enabling Grids for E-sciencE EGEE-III INFSO-RI GridMap visualization system NEC09, Varna, Julia Andreeva (CERN, IT/GS) 21  GridMap visualization tool had been developed in the context of CERN Openlab collaboration between CERN and EDS company  The main motivation for GridMap development is to provide a high level view of the monitoring data collected from the distributed infrastructure in a intuitive and useful way.  Perfect match of the requirements for visualization of the distributed hierarchical infrastructure and GridMap visualization

Enabling Grids for E-sciencE EGEE-III INFSO-RI Use-cases for GridMap NEC09, Varna, Julia Andreeva (CERN, IT/GS) 22  Multiple use cases had been defined for GridMap : GridMap for Experiment Work Flows GridMap for status of services defined as critical by the LHC VOs GridMap for Site Status Board  Analyzing the results of the CCRC08 and STEP09 one of the main conclusions was that sites are a bit disoriented regarding monitoring. Too many monitoring tools… Which ones to use? Which ones to trust? How to understand whether VOs served by the site are happy about site performance ? Siteview application aims to provide estimation of site performance from the VO perspective

Enabling Grids for E-sciencE EGEE-III INFSO-RI High-level monitoring system for sites serving LHC VOs ATLAS ALICE CMS LHCb Central repository for common metrics (transfer rate, parallel jobs, success rate, etc…) Grid Map for a particular site Common metrics distributions by time EGEE'08 - Julia Andreeva (CERN, IT/GS)

Enabling Grids for E-sciencE EGEE-III INFSO-RI Siteview(1/4) NEC09, Varna, Julia Andreeva (CERN, IT/GS) 24 Map is split in 4 groups: Overall status of the site from the VO perspective Job processing activity Incoming data transfer Outgoing data transfer Size of the cell is defined by the scale of a given activity, colour is defined by the success rate

Enabling Grids for E-sciencE EGEE-III INFSO-RI Siteview (2/4) NEC09, Varna, Julia Andreeva (CERN, IT/GS) 25 Assists users to navigate to the primary information source

Enabling Grids for E-sciencE EGEE-III INFSO-RI Siteview (3/4) NEC09, Varna, Julia Andreeva (CERN, IT/GS) 26 Click to get more information about failures

Enabling Grids for E-sciencE EGEE-III INFSO-RI Siteview (4/4) NEC09, Varna, Julia Andreeva (CERN, IT/GS) 27 Click to get to the primary information source

Enabling Grids for E-sciencE EGEE-III INFSO-RI Integration with Google Erath NEC09, Varna, Julia Andreeva (CERN, IT/GS) 28 Experiment specific monitoring systems provide input data Dashboard agents publish this information in the KML format Strong contribution to the development of Sergey Mitsyn (JINR) Application will be shown during the LHC demo at the EGEE conference in Barcelona

Enabling Grids for E-sciencE EGEE-III INFSO-RI Summary NEC09, Varna, Julia Andreeva (CERN, IT/GS) 29 Practical experience in operating Grid infrastructure and in using it by the LHC community (in particular during CCRC08, STEP09) proved that VO-specific monitoring systems are the vital part of the operations and are currently the main source of the monitoring information Having a wide range of VO-specific monitoring tools in place, we were still missing the high level view of the computing activities for LHC experiments altogether both at the global and at the site level This issues are being addressed in the current development. Siteview application is being developed and evaluated by the LHC community