New monitoring applications in the dashboard

Slides:



Advertisements
Similar presentations
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
Advertisements

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Nightly Releases and Testing Alexander Undrus Atlas SW week, May
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
PanDA Monitor Development ATLAS S&C Workshop by V.Fine (BNL)
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
ATLAS Experience with GGUS Guido Negri INFN – Milano Italy.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Progress on first user scenarios Stephen.
ATLAS Dashboard Recent Developments Ricardo Rocha.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
G.Govi CERN/IT-DB 1 September 26, 2003 POOL Integration, Testing and Release Procedure Integration  Packages structure  External dependencies  Configuration.
The GridPP DIRAC project DIRAC for non-LHC communities.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
K. Harrison CERN, 21st February 2005 GANGA: ADA USER INTERFACE - Ganga release Python client for ADA - ADA job builder - Ganga release Conclusions.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
The GridPP DIRAC project DIRAC for non-LHC communities.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
EGEE is a project funded by the European Union under contract IST Issues from current Experience SA1 Feedback to JRA1 A. Pacheco PIC Barcelona.
ConTZole Tomáš Kubeš, 2010 atlas-tz-monitoring.cern.ch An Interactive ATLAS Tier-0 Monitoring.
II EGEE conference Den Haag November, ROC-CIC status in Italy
Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI GLUE 2: Deployment and Validation Stephen Burke egi.eu EGI OMB March 26 th.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Daniele Bonacorsi Andrea Sciabà
Monitoring Evolution and IPv6
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
Project Center Use Cases Revision 2
Project Center Use Cases
Key Activities. MND sections
ALICE Monitoring
POW MND section.
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
ATLAS support in LCG.
FTS Monitoring Ricardo Rocha
Readiness of ATLAS Computing - A personal view
Savannah to Jira Migration
Testing for patch certification
LCGAA nightlies infrastructure
1 VO User Team Alarm Total ALICE ATLAS CMS
Operations & Coordination Tools
Project Center Use Cases Revision 3
Project Center Use Cases Revision 3
Pole 3 – Dashboard Assessment COD 20 - Helsinki
LCG Operations Workshop, e-IRG Workshop
Monitoring of the infrastructure from the VO perspective
Leigh Grundhoefer Indiana University
Status and plans for bookkeeping system and production tools
Presentation transcript:

New monitoring applications in the dashboard Benjamin Gaidioz (CERN) EGEE User Forum, Clermont-Ferrand 2008

Outline Dashboard Updates on job monitoring, framework, Production monitoring for ATLAS Conclusion

The dashboard: a grid monitoring project Showing VO usage of the grid (jobs, data), Developed at CERN Main applications: Job monitoring, data transfers monitoring (ATLAS), Efficiency of grid sites, http://dashboard.cern.ch Recent developments: User analysis tasks, HEP production monitoring and shifters help.

Some updates about existing applications and components

The job monitoring application A tool helping the VO to see its jobs on the grid, Multi grid, VO metadata associated to jobs, RGMA, IC-XML, LCG BDII and VO data

The job monitoring application A tool helping the VO to see its jobs on the grid, Multi grid, VO metadata associated to jobs, RGMA, IC-XML, LCG BDII and VO data

Update about the job monitoring application Standalone installation by NIKHEF for VL-eMed (VO), Running since a year, Now in collaboration with them to upgrade and include vlemed metadata in the monitoring information More info: dashboard poster, fMRI analysis demo

User task monitoring A user view for the job monitoring information, Hopefully more practical for distributed analysis users.

User task monitoring A user view for the same data (show progress of tasks)

A user view for the same data (show progress of tasks) User task monitoring A user view for the same data (show progress of tasks) Integrated with the job monitoring application

dashboard framework: common base for applications: Apache, mod_python, XSL/XML, Build system, release system, documentation (dev guide, demo application)

dashboard framework: common base for applications: Very practical flexibility in output (text/xml, text/csv, HTML, png) Common base for API and CLI.

dashboard framework: current developments: Monitoring of components integrated with Lemon, Nagios, Integration with SAM (service monitoring in EGEE): e.g. alert if ATLAS production is bad on site Improve the flexibility of web applications for faster developments (more component based, AJAX, etc.).

Monitoring of the production for ATLAS

ATLAS production activities: 75,000 jobs per day, on three middlewares (EGEE, OSG, Nordugrid), A central “queue” of jobs to run, dequeued by several submission systems (now converging to one), A couple of experts and many volunteers shifters, Knowledge transfer from experts to trainee shifters: Procedures, workbook, Tutorials during software weeks, Permanent mailing-list, logbook, chat, etc.

Monitoring: ATLAS is using a dashboard application for data transfers (DDM monitoring, for DMM see their presentation and poster), Had a working monitoring application for production (John Kennedy, ATLAS), Asked to have a dashboard application instead: Maintenance was becoming an issue, Potential integration with data transfer monitoring, Prototype developed from feb to sep 2007

Developments: Several meetings with production system developers and expert shifters, Private discussions with expert shifters, Identification of missing features in the current system, etc. Observations: People were really willing to explain and give advices, but it's very time consuming for them (explanations, e-mails, etc.).

Prototype: Provided practical browsing functionalities (missing in the original application), Overview displays (number of jobs, errors today, etc.), User's guide, FAQ Observations: In principle all possible views were accessible in three clicks, Everyone seemed to like.

terminated jobs today (main four) select a view main four errors today dig into the information today, walltime summary go to the error page

go to the job details display first three errors of each

click a job to get more details

For automatization instead of browsing a web page: A python API For automatization instead of browsing a web page: from dashboard.api.production import ProductionQuery query = ProductionQuery() errors = query.errors(site=”IN2P3-LPC”, grouping='clusters') for error in errors: # etc.

Integration in the tutorials: ATLAS was organizing tutorials in sep 2007, Dashboard part of it (as “next monitoring tool, please send feedback”), Demo based on the instructions found in tutorial slides, Howto in the user's guide. Lots of feedback (most from managers and expert shifters), However the existing tools were still the ones to use (sep 2007).

Developments then (sep 2007 to dec 2007): Integration of middleware information systems (LCG BDII, OSG, Nordugrid ARC) to recognize grids, sites, from the cluster name reported by the production system, Manager's display (select dates, plots of success rate, share between the three grids, etc.), Integration with the monitoring of submission tools (links to log files for example).

Daily statistics

More filtering: grid, sites, etc.

Official switch to the dashboard (jan 2008): Big warning in the former tool, Tutorial slides based on the dashboard, Shifter's workbook referring to the dashboard, New feedback from shifters (non experts), The developer did a shift (using the dashboard), Suggested by the developer of the former ATLAS tool, B. Gaidioz did two days of shifts following the workbook, using the dashboard.

Outcome after two days of shifts: Three clicks each time is too much! Lots of cut & paste from dashboard to GGUS, etc. One needs an immediate display matching the shifter's workbook: “for error X, submit a savannah ticket, write into it this and that”, “for error Y, submit a GGUS ticket with these details”, “look at clouds (tier1+ its tier2) with high failures and do that”. All the information was in the dashboard But the interface is practical enough.

Prototype after some days: Software installation errors Data copy errors by site Prototype after some days: ATLAS software errors by task Main errors for each middleware Cloud success rate

Feedback for the shifter's display: Very positive feedback from everyone. Other improvements to come: Simple ones (but were never expressed before): Simple information missing (need other tool or three more clicks), Need links to some tools (logbook, etc.) with URL params ready, Need a pre-filled template for GGUS ticket / savannah bug, More sophisticated improvements: Error codes are not always sufficient, need to check some correlations before taking actions

Summary & conclusion

Summary Job monitoring installed in VL-eMed, User task monitoring implemented with job monitoring, The dashboard framework getting stronger and integrated with Lemon, Nagios, SAM. A specific monitoring application for ATLAS production Now an operation tool.

Conclusion Requirements came naturally after two days of operation, Dashboard: “monitoring the grid from the VO point of view”, (Obviously, running the VO systems teaches you a lot). Need flexibility in web interface: Because we will need very often custom displays, XSL web pages are hard to design as “components” Push for using the python APIs: Permits to get very specific information in zero click, Perhaps integrate it in the VO system.

More info: http://dashboard.cern.ch Experience with the VL-eMed installation (poster: the experiment dashboard for medical applications).