CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,

Slides:



Advertisements
Similar presentations
Experience In Developing Dynamic Web Interfaces: The Case Study of the ALICE Job Reliability Dashboard Eamonn Maguire IT-PSS 30-Aug
Advertisements

Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
Analysis demos from the experiments. Analysis demo session Introduction –General information and overview CMS demo (CRAB) –Georgia Karapostoli (Athens.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Service Management GLM 15 November 2010 Mats Moller IT-DI-SM.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.
Real Time Monitor of Grid Job Executions Janusz Martyniak Imperial College London.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Dashboard program of work Julia Andreeva GS Group meeting
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Priorities update Andrea Sciabà IT/GS Ulrich Schwickerath IT/FIO.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
CERN IT Department CH-1211 Genève 23 Switzerland t DM Database Monitoring Tools Database Developers' Workshop CERN, July 8 th, 2008 Dawid.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Visualization Ideas for Management Dashboards
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
Julia Andreeva on behalf of the MND section MND review.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Tim Bell CERN IT/OIS 7 th September 2010 Service Management Meeting.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
WLCG Transfers Dashboard A unified monitoring tool for heterogeneous data transfers. Alexandre Beche.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland.
1 DIRAC Project Status A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 10 March, DIRAC Developer meeting.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
CERN IT Department CH-1211 Geneva 23 Switzerland t LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview.
HPDC Grid Monitoring Workshop June 25, 2007 Grid monitoring from the VO/user perspectives Shava Smallen.
Key Activities. MND sections
POW MND section.
New monitoring applications in the dashboard
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu, B. Gaidioz, J. Herrala, E.J. Maguire, G. Maier, R. Rocha, P. Saiz

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 2 Table of Content What is Grid reliability? How do we do it? What do we do? –Data Management –Workload Management To Do list Conclusions Useful links

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 3 Grid Reliability Our goal: –Provide tools to detect, investigate and solve all the possible Grid errors. How: –Using the experiment’s dashboards Monitoring the user jobs Deliverables: –Efficiency tables Site performances as seen by selected applications Tools to monitor the sites “day-by-day” and augment the available information for more efficient debugging

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 4 How we collect data DASHBOARD RGMA MonALISA IC RuntimeMonitor RB For some RB: edg-get-logging-info –v 2 edg-get-status -v 2 Already deployed for ALICE, ATLAS, LHCb and CMS VleMed in its way…

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 5 Displaying the data DASHBOARD HTML pages CSV list XML files We can display the same data in different formats For more details, see Julia Andreeva’s talk, Wednesday at 17:10: Grid Monitoring from the VO/ User perspective. Dashboard for the LHC experiments.

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 6 Grid Reliability Workload management –Deployed for ALICE, ATLAS, CMS and LHCb –Monitor jobs through RGMA and Imperial College Runtime Monitor Data management: –FTS for ALICE Deployed in September 2006 Heavily used during the service challenges –DDM for ATLAS For more details, see Ricardo Rocha’s talk, Thursday at 17:10: Monitoring the Atlas Distributed Data Management System

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 7 Investigating job workflow Looking at the final state: –Simple: –Reliability of the whole system Looking at all the status changes: –More information –Possibility to catch errors solved by the middleware –Reliability of different sites

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 8 Job vs. Job Attempt There is only one job. However, we split it in four job attempts, and study each one independently

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 9 Tools for Job Reliability ‘Site of the day’: –daily report on number of successful/failed job attempts Site performance –Evolution of a site over a period of time Error list –Most Common list of error messages, with pointers to documentation –Evolution of the error over time Waiting time –Time that users have to wait from the moment they submit the job until they get the results back Aggregated reports: –Monthly reports –Multi vo reports

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 10 Web portal

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 11 Site of the day Ranking of the efficiency of the sites

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 12 Site of the day For each VO:

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 13 Getting the job ids (and history) of jobs that failed. Site of the day

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 14 Site performance Reliability of a site over a period of time

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 15 Error list Select the list of most common error Progression of error over time Pointers to the gocwiki (wherever possible) Restrict to a site and/or month

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 16 Waiting time Total time (from submission to completion) for a type of jobs

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 17 Aggregated reports Daily multi-VO report for a selected number of sites (T1) –See at a glance if everything is working Monthly automatic reports with: –Efficiency tables –Summary of job attempts per site

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 18 Multi VO report Clicking on any cell expands the information Available since January

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 19 Automatic reports Reports automatically created at the end of each month

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 20 ALICE Data Management Daily reports on successful/failed transfers Last 24 hours report updated every hour

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 21 ALICE FTD-FTS

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 22 To do list: Study job workflow from other sources: –Pass the information from DIRAC (LHCb) X509 authentication in the web interface Group similar job attempts into patterns Data management reports for ATLAS Support its usage by sites and VOs –Always open to suggestions –Adjust the tools according to the requests

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 23 Conclusions We provide several tools to investigate the grid reliability Data Management: –FTS reliability for ALICE Workload Management: –‘Site of the day’, ‘Error messages’, ‘Site performance’ –Aggregated views Multi VO and automatic d monthly reports Already in use by ALICE, ATLAS, CMS and LHCB –Deployment for Vlemed on its way

CERN - IT Department CH-1211 Genève 23 Switzerland t CHEP2007,Victoria, Canada Grid Reliability - 24 Useful links: FTS : WMS: