Download presentation
Presentation is loading. Please wait.
Published byLee Cobb Modified over 8 years ago
1
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu, B. Gaidioz, J. Herrala, E.J. Maguire, G. Maier, R. Rocha, P. Saiz
2
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 2 Table of Content What is Grid reliability? How do we do it? What do we do? –Data Management –Workload Management To Do list Conclusions Useful links
3
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 3 Grid Reliability Our goal: –Provide tools to detect, investigate and solve all the possible Grid errors. How: –Using the experiment’s dashboards Monitoring the user jobs Deliverables: –Efficiency tables Site performances as seen by selected applications Tools to monitor the sites “day-by-day” and augment the available information for more efficient debugging
4
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 4 How we collect data DASHBOARD RGMA MonALISA IC RuntimeMonitor RB For some RB: edg-get-logging-info –v 2 edg-get-status -v 2 Already deployed for ALICE, ATLAS, LHCb and CMS VleMed in its way…
5
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 5 Displaying the data DASHBOARD HTML pages CSV list XML files We can display the same data in different formats For more details, see Julia Andreeva’s talk, Wednesday at 17:10: Grid Monitoring from the VO/ User perspective. Dashboard for the LHC experiments.
6
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 6 Grid Reliability Workload management –Deployed for ALICE, ATLAS, CMS and LHCb –Monitor jobs through RGMA and Imperial College Runtime Monitor Data management: –FTS for ALICE Deployed in September 2006 Heavily used during the service challenges –DDM for ATLAS For more details, see Ricardo Rocha’s talk, Thursday at 17:10: Monitoring the Atlas Distributed Data Management System
7
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 7 Investigating job workflow Looking at the final state: –Simple: –Reliability of the whole system Looking at all the status changes: –More information –Possibility to catch errors solved by the middleware –Reliability of different sites
8
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 8 Job vs. Job Attempt There is only one job. However, we split it in four job attempts, and study each one independently
9
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 9 Tools for Job Reliability ‘Site of the day’: –daily report on number of successful/failed job attempts Site performance –Evolution of a site over a period of time Error list –Most Common list of error messages, with pointers to documentation –Evolution of the error over time Waiting time –Time that users have to wait from the moment they submit the job until they get the results back Aggregated reports: –Monthly reports –Multi vo reports
10
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 10 Web portal
11
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 11 Site of the day Ranking of the efficiency of the sites
12
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 12 Site of the day For each VO:
13
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 13 Getting the job ids (and history) of jobs that failed. Site of the day
14
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 14 Site performance Reliability of a site over a period of time
15
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 15 Error list Select the list of most common error Progression of error over time Pointers to the gocwiki (wherever possible) Restrict to a site and/or month
16
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 16 Waiting time Total time (from submission to completion) for a type of jobs
17
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 17 Aggregated reports Daily multi-VO report for a selected number of sites (T1) –See at a glance if everything is working Monthly automatic reports with: –Efficiency tables –Summary of job attempts per site
18
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 18 Multi VO report Clicking on any cell expands the information Available since January
19
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 19 Automatic reports Reports automatically created at the end of each month
20
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 20 ALICE Data Management Daily reports on successful/failed transfers Last 24 hours report updated every hour
21
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 21 ALICE FTD-FTS
22
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 22 To do list: Study job workflow from other sources: –Pass the information from DIRAC (LHCb) X509 authentication in the web interface Group similar job attempts into patterns Data management reports for ATLAS Support its usage by sites and VOs –Always open to suggestions –Adjust the tools according to the requests
23
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 23 Conclusions We provide several tools to investigate the grid reliability Data Management: –FTS reliability for ALICE Workload Management: –‘Site of the day’, ‘Error messages’, ‘Site performance’ –Aggregated views Multi VO and automatic d monthly reports Already in use by ALICE, ATLAS, CMS and LHCB –Deployment for Vlemed on its way
24
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada Pablo.Saiz@cern.ch Grid Reliability - 24 Useful links: http://dashboard.cern.ch FTS : http://dboard-gr.cern.ch/dashboard/data/fts/index.html WMS: http://dashb-alice.cern.ch/jr.html http://dashb-atlas-job.cern.ch/jr.html http://dashb-lhcb.cern.ch/jr.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.