Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.

Slides:

Advertisements

Similar presentations

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Advertisements

Experience In Developing Dynamic Web Interfaces: The Case Study of the ALICE Job Reliability Dashboard Eamonn Maguire IT-PSS 30-Aug

A tool to enable CMS Distributed Analysis

Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.

Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.

Real Time Monitor of Grid Job Executions Janusz Martyniak Imperial College London.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Dashboard program of work Julia Andreeva GS Group meeting

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.

WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

ATLAS Dashboard Recent Developments Ricardo Rocha.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.

Julia Andreeva on behalf of the MND section MND review.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

ANALYSIS TOOLS FOR THE LHC EXPERIMENTS Dietrich Liko / CERN IT.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –

New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

WLCG Transfers Dashboard A unified monitoring tool for heterogeneous data transfers. Alexandre Beche.

CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,

The GridPP DIRAC project DIRAC for non-LHC communities.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.

HPDC Grid Monitoring Workshop June 25, 2007 Grid monitoring from the VO/user perspectives Shava Smallen.

Daniele Bonacorsi Andrea Sciabà

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

Key Activities. MND sections

POW MND section.

Monitoring: problems, solutions, experiences

Experiment Dashboard overviw of the applications

A Messaging Infrastructure for WLCG

Monitoring of the infrastructure from the VO perspective

Presentation transcript:

Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS On behalf of the Dashboard team J. Andreeva, S. Belov, C. Cirstoiu, Y.Chen, B. Gaidioz, J. Herrala, G. Maier, R. Pezoa Rivera, R. Rocha, P. Saiz, I. Sidorova

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 2 Table of content Common requirements for Grid monitoring from the VO/User perspective Motivation and evolution of the Experiment Dashboard project Overview of the current functionalities Future plans Conclusions

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 3 Common requirements regarding VO monitoring  Provide transparent and complete picture of the experiment activities on the Grid regardless the underlying infrastructure where the actual job/transfer/service is running  Combine Grid monitoring data with experiment/application/activity specific information of the VO interest  Be able to identify problem of any nature (Grid or application)  Satisfy users with various roles (various areas of activities, different scope)  Provide high level of flexibility to allow rapid integration with the new requirements  Scale and perform well

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 4 Experiment Dashboard concept Information sources Generic Grid Services Experiment specific services Experiment work load management and data management systems Jobs instrumented to report monitoring information Monitoring and accounting systems (RGMA, GridIce, SAM, ICRTMDB, MonaAlisa, BDII, Apel, Gratia…) Collect data of VO interest coming from various sources Store it in a single location Provide UI following VO requirements Analyze collected data VO users with various roles Potentially other Clients: PANDA, ATLAS production

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 5 Dashboard Framework Web / HTTP Interface Data Access Layer (DAO) Agents Oracle DB DB reading and writing via DAO layer Connection pooling Easy to add interface for a different backend Agents are running on regular basis. Collecting data from different sources, generating/analyzing statistics, managing alarms Common configuration and management and common monitoring mechanism Dashboard clients: scripts(pycurl…), cli (optparser + pycurl) shell based (curl…) Web application based on Apache + mod python Multiple output formats: plain text, csv, xml, xhtml GSI support using gridsite

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 6 Development team The tool is developed by ARDA (CERN) team in collaboration with MonAlisa (Caltech) developers and participation of ASGC (Taiwan), MSU and JINR (Russia) and LAL (France)‏ Valuable contribution this year from CERN summer students People contributed to the development: J. Andreeva, S. Belov, A. Berejnoj, C. Cirstoiu, Y.Chen, T.Chen, S. Chiu, M. De Francisco De Miguel, A. Ivanchenko, B. Gaidioz, J. Herrala, M. Janulis, O. Kodolova, G. Maier, E.J. Maguire, C. Munro, R. Pezoa Rivera, R. Rocha, P. Saiz, I. Sidorova, F.Tsai, E. Tikhonenko, E. Urbah

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 7 Evolution of the project 07/0510/0501/0604/0607/0610/0603/0701/07 First prototype for CMS job monitoring Transfer monitoring for ALICE in production Job monitoring for LHCb in production Job monitoring for ALICE in production Job monitoring for CMS and ATLAS in production ATLAS Data Management in production 06/07 Dashboard for VLEMED (BioMed)‏ Monitoring for ATLAS Production (prototype) Task monitoring for CMS user analysis in production 08/07 Site/service availability based on SAM tests (CMS)

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 8 Dashboard covers wide range of the activities of the LHC experiments Transfer monitoring for ALICE Data management monitoring for ATLAS Production monitoring for ATLAS and CMS (prototypes) IO rate monitoring between WN and SE (prototype) Site availability based on the results of SAM tests (prototype) Job Robot monitoring Accounting information from Apel and Gratia for ATLAS (prototype) Task monitoring for CMS analysis users (ATLAS on the way) Job monitoring Site reliability Experiment Dashboard COMMON applications ALICE, ATLAS, CMS, LHCb, Vlemed CMS Integration and commissioning Experiment specific applications

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 9 Job monitoring What is the status of the jobs -belonging to an individual user/group/VO -submitted to a given site or Grid flavor or via a given resource broker -reading a certain data sample, running a certain application… If they are pending/running – for how long, where? If they are finished, whether they failed or ran properly? If they failed – why?

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 10 Information flow for job monitoring Job Submission Tools (CRAB, ProdAgent, Panda, Ganga Dashboard for Job Monitoring Grid monitoring systems (RGMA, ICRTM, GridIce, BDII) Experiment specific monitoring systems (Production system in Atlas, Dirac monitoring) Jobs at the WNs MonAlisa service At the submission time META information about user task Submission info for individual jobs, job status info while retrieving output or checking job status Running jobs report their progress Grid status info only for jobs submitted via RB (RGMA, ICRTM) Jobs status according to the local batch system (only where GridIce is running) In collaboration with condor_g team currently working on reporting of the job status information from condor_g submitter to Dashboard via MonAlisa Due to multiple information sources Dashboard Job Monitoring application is not limited to a given middleware flavour or to a given submission method

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 11 Example of Job monitoring UI

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 12 Monitoring of the analysis tasks for the CMS users Meta information about task Detailed info about all jobs of a given group Distribution of jobs of a given group by Site, CE or RB

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 13 Site Reliability Application ‘Site of the day’: –daily report on number of successful/failed job attempts (submission via RB) Site performance –Evolution of a site over a period of time Error list (Grid errors related to job processing by RB) –Most Common list of error messages, with pointers to documentation –Evolution of the error over time Waiting time –Time that users have to wait from the moment they submit the job until they get the results back Aggregated reports: –Automatic monthly reports –Multi VO reports For more details about Job Reliability Application see “Grid reliability” presentation of Pablo Saiz

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 14 Data Management Monitoring for ATLAS Tied to the ATLAS Distributed Data Management (DDM) system Used successfully both in the production and Tier0 test environments Data sources: –DDM site services: the main source, providing all the transfer and placement information –SAM tests: for correlation of DDM results with the state of the grid fabric services –Storage space availability: currently from BDII but soon including other available tools Views over the data: –Global: site overview covering different metrics (throughput, files / datasets completed,...); summary of the most common errors (transfer and placement)‏ –Detailed: starting from the dataset state, to the state of each of its files, to the history of each single file placement (all state changes)‏ For more details about ATLAS Data Management Monitoring see Thursday presentation of Ricardo Rocha

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 15 Integration and commissioning for CMS (1) Site and service availability based on the results of SAM tests (Prototype) Results of SAM tests are imported in real time to the dashboard DB Service and site availability are calculated according the experiment’s requirements Select site or set of sites and service types to be included in the report

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 16 Integration and commissioning for CMS (2) Monitoring of the I/O rate between WN and SE (Prototype) Currently only analysis and JobRobot jobs are reporting I/O rate to the dashboard

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 17 Integration and commissioning for CMS (3) Monitoring of the Job Robot jobs using Dashboard interactive UI Sites having troubles

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 18 Monitoring for the production systems of ATLAS and CMS In case of ATLAS, Dashboard will provide a user interface to the monitoring data stored in the ATLAS production DB. UI is in active development phase. First prototype is available. In case of CMS, Dashboard DB is used as a central repository for monitoring data. Dashboard Collector for monitoring information from CMS ProdAgent instances is in production. UI interface to the central monitoring repository is being developed.

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 19 ATLAS production UI (prototype)

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 20 ATLAS accounting information Information is retrieved from Apel and Gratia UI allows to show data taking into account experiment topology

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 21 Experiment Dashboard plans Implementation Important schema modification is required to support pilot jobs. It will cause modification in the data feeding part and user interface Secure access where relevant (X509 authentication) Improvement of data completeness and reliability Enabling of reporting of job status information from condor_g (on the way) Development of the new applications Monitoring for production systems of ATLAS and CMS Service availability based on sanity check reports sent by the experiment jobs (LHCb) Improvement of effectiveness for troubleshooting Analyzing of information about failures (Grid and application) Decoupling application failures caused by the error in the user code from the failures caused by the problems of the Grid services Collecting troubleshooting recipes, making them available at the dashboard UI Correlating where relevant failures with the results of the SAM tests

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 22 Conclusions Experiment Dashboard is used by all 4 LHC experiments and evolving very fast to match their requirements Job Monitoring and Site Reliability applications are in production for VLEMED VO outside LHC community The tool proved to provide reliable and useful VO- oriented monitoring data, with needed level of details, available in various formats. Give it a try !

CHEP 2007, Victoria, Canada Julia Andreeva, CERN 23 Acknowledgement We would like to thank Stefano Belforte, Massimo Lamanna and Iosif Legrand. Without their support and guidance the project wouldn’t start and wouldn’t progress; Our collaborators in Taiwan, Russia and France for their valuable contribution; ORACLE support team for excellent DB support and useful advices; SAM, ICRTM, RGMA, GridIce, Condor_g, EIS, FIO at CERN IT, Gratia and Apel teams for fruitful collaboration and prompt responses to our requests; Developers of job submission tools, production systems, data management systems of the LHC experiments for their contribution and LHC user community for useful feedback.