CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Slides:



Advertisements
Similar presentations
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Advertisements

Workload Management Massimo Sgaravatto INFN Padova.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Dashboard program of work Julia Andreeva GS Group meeting
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Design of an Expert System for Enhancing.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Julia Andreeva on behalf of the MND section MND review.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.
CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
ConTZole Tomáš Kubeš, 2010 atlas-tz-monitoring.cern.ch An Interactive ATLAS Tier-0 Monitoring.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
HPDC Grid Monitoring Workshop June 25, 2007 Grid monitoring from the VO/user perspectives Shava Smallen.
Daniele Bonacorsi Andrea Sciabà
Key Activities. MND sections
POW MND section.
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on behalf of the Dashboard team

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Table of content Importance of Job Monitoring Overview of the Dashboard Job Monitoring applications Monitoring of user analysis Conclusions

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Importance of Job Monitoring Data distribution and data processing are two main computing activities for VOs running on WLCG infrastructure Quality of job processing provides the estimation of the quality of the infrastructure in general and defines the overall success of the computing activities of the VOs On the other hand, detailed and reliable job monitoring helps to improve the computing models of the LHC VOs

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Dashboard main goals The goal of the Experiment Dashboard is to monitor the activities of the LHC experiments on the distributed infrastructure, providing monitoring data from the virtual organization (VO)/user perspectives. The LHC experiments use various Grid infrastructures (LCG/EGEE, OSG, NDGF) with correspondingly various middleware flavors and job submission methods. The main task is to provide a uniform and complete view of various activities like job processing, data movement and publishing, access to distributed databases regardless of the underlying Grid flavor.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overview of the Dashboard job monitoring applications Atlas ProdSys Monitoring Central repository for CMS ProdAgent monitoring data Generic Job Monitoring Monitoring of user analysis jobs

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Data flow for Dashboard Job Monitoring LB, CEMon, Condor-g, jobs instrumented to report their progress, Job Submission Tools of the experiments MonALISA, currently we’re going to switch to the Messaging System for the Grid (MSG) Data is available in various formats and can be presented for different categories of users: VO managers, computing shifters, MC production teams, Site commissioning, LHC physicists running their jobs on the Grid

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Dashboard job monitoring is designed as common solution for any virtual organization To provide complete view of job processing both from the Grid and application point of view the VO job submission tools should be instrumented to report job’s status information Dashboard job monitoring for CMS is the most advanced one since all CMS submission tools are well instrumented for the Dashboard reporting

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Dashboard Job Monitoring functionality Interactive view Shows what is going on NOW. -Distribution of jobs by site, CE, user, submission tool, application version, dataset, etc… -Distribution of jobs by status -Success rate, CPU and wall clock time, number of processed events Historical Interface Job statistics distributed over time Dashboard Task Monitoring Provides complete information about analysis job processing. Serves the needs of the analysis community and of the analysis support team Quick Analysis of Error Sources Automatically detects failing grid components and offers solutions to solve the problems.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Distributed user analysis on the WLCG infrastructure is currently the main challenge of the LHC computing With data taking approaching number of analysis users will dramatically increase User-friendly, complete and reliable monitoring of the analysis task processing is an important factor for successful organization of the distributed analysis Task Monitoring application is developed on top of the common job monitoring repository Main users of the application are LHC physicists, distributed analysis support teams and site administrators Dashboard Task Monitoring

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services CMS Task monitoring for analysis users Provides transparent monitoring regardless submission method or middleware platform Detailed view of user tasks including failure diagnostics, processing efficiency and resubmission history Low latency, updates from the worker node where job is running User driven development Progress of processing in terms of processed events Distribution of jobs by their current status Very detailed per job information Failure diagnostics for GRID and application failures Distribution of efficiency By site 10

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services CMS Dashboard Usage by application Application is currently in production for CMS VO Became very popular in the CMS physics community Got a very positive feedback from the users Up to 150 physicists are using the application on everyday basis

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Trying to understand the failure reasons  VO jobs are instrumented to report the application exit codes. Unfortunately the exit codes are not always pointing to a particular failure reason. In most cases they are rather obscure  Failure of the job can be caused by many different reasons: Error in the user code Misconfiguration of the site -misconfiguration of the worker nodes on the site -corrupted distribution of the experiment software -problem of the accessing of the shared area from the worker node -etc... Problem accessing input data Problem saving output files to the remote storage  In order to adress the problem it’s necessary to understand the underlying reason. Possible ways to achive this goal are: - better diagnostics in case of failure published from the user jobs - analysis of the failure statistics  The Dashboard team works in both directions. In the first case in collaboration with the developers of the workload management systems of the experiments  In the ideal case the user doesn’t need to open the log file to understand what went wrong with his job but can get all the necessary information from user interface.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Quick Analysis of Error Sources (QAOES) System of problem detection with the Association Rule Mining algorithm Expert system Aim is to decrease a time of fault detection and to improve grid reliability QAOES prototype is in production for the CMS analysis job monitoring data The tool is being evaluated by CMS distributed analysis support team

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services QAOES use case 1

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services QAOES use case 1 User analysis jobs have very low success rate. Let’s see whether these jobs relate to one particular user or not. Jobs overview on T2_FR_IPHC site sorted by activity for the last 6 hours

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Jobs overview on T2_FR_IPHC site sorted by user for the last 6 hours Almost all users have jobs failed from the application point of view. Let’s check why two users don’t have such problem.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services QAOES use case 1 Jobs overview on T2_FR_IPHC site sorted by dataset for the last 6 hours Jobs with “unknown” dataset don’t use any input data. And the other jobs failed with 8020 error code. It’s a data access problem. We see that automatically generated rule correctly detected the faulty component – site. Namely, data access problem at the site.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services QAOES use case 2

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services QAOES use case 2 Distribution of the user jobs per site for the last 6 hours A large number of jobs belonging to the user failed, cancelled or aborted on different sites. Let’s check if it happened with the particular task of the user or not. Sort the jobs by task.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Distribution of the user jobs per task The user jobs fail on different sites with different tasks. It could be an input data problem. Let’s check.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Distribution of the user jobs per dataset User is failing at various sites running different tasks and reading different datasets. It’s a clear indication of an error in the user code. Which is consistent with the automatically generated rule.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Add a solution

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Conclusions Monitoring of the job processing is one of the main indicators for estimation of the overall quality of the Grid infrastructure User-friendly, reliable and complete monitoring is vital for effective organization of the distributed data analysis. Developed in the close collaboration with the user community, Dashboard job monitoring applications provide required functionality for LHC offline computing activity Future development and improvements are driven by the feedback and suggestions of the LHC users

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Thank you for your attention!