EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem.

Slides:



Advertisements
Similar presentations
Analysis demos from the experiments. Analysis demo session Introduction –General information and overview CMS demo (CRAB) –Georgia Karapostoli (Athens.
Advertisements

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Input from CMS Nicolò Magini Andrea Sciabà IT/SDC 5 July 2013.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.
CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The network monitoring in grid context Operations.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Dashboard program of work Julia Andreeva GS Group meeting
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
ATLAS Dashboard Recent Developments Ricardo Rocha.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
INFSO-RI Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, ,
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Julia Andreeva on behalf of the MND section MND review.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
1. 2 Overview Extremely short summary of the physical part of the conference (I am not a physicist, will try my best) Overview of the Grid session focused.
Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,
ConTZole Tomáš Kubeš, 2010 atlas-tz-monitoring.cern.ch An Interactive ATLAS Tier-0 Monitoring.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Daniele Bonacorsi Andrea Sciabà
Key Activities. MND sections
POW MND section.
Experiment Dashboard overviw of the applications
Analysis Operations Monitoring Requirements Stefano Belforte
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem Workshop

Enabling Grids for E-sciencE EGEE-III INFSO-RI Jumping to conclusions A variety of tests run during STEP09 ---> a variety of monitoring systems used We certainly were not running blind, and could follow pretty well what is going on For following of the Experiment activities in most cases the VO-specific monitoring systems had been used For checking the health of the services and of the sites VOs mostly relied on the centrally provided monitoring systems like SAM and SLS 2 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI Short questioner sent to all 4 experiments Do you think that your VO had all necessary monitoring tools and they provided required functionality in order to follow STEP09 ? What has to be improved ? Which monitoring systems had been used for every particular test? Was it possible to see the overall picture (all 4 experiments)? Wish list … Thanks a lot for all people providing input and sending answers. 3 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI ALICE ALICE did not suffer any lack of information regarding monitoring and was able to follow STEP09 activities pretty well. Both for transfer (rate) and job processing ALICE used native ALICE monitoring service based on MonAlisa. For transfer efficiency and errors ALICE used Dashboard. For looking in the overall picture regarding transfer ALICE used GridView. No particular requests regarding monitoring. 4 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI ATLAS In general, ATLAS did have necessary monitoring infrastructure to follow STEP09, though some issues had been seen and there is a room for improvements (my conclusion from ATLAS answers) 5 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI ATLAS transfer monitoring For data transfer Dashboard had been used. Good for overall data transfer. Noticed problem: Can magnify a single error so much that it's hard to see anything else (filter out known problems would be useful) What is missing for specific things needed for operations: 1. Monitoring of broken subscriptions 2. Monitoring of queues of subscriptions 3. Monitoring of subscriptions not picked up 4. Information ordered by source 5. Development of drill down plots giving efficiency and bandwidth consumed in a given time period 6. Some work on the pre-stage monitoring, especially for staged files and datasets The work Ricardo did on the 2D plots, generated on the client side, looks to be like a very healthy development. This is probably the way to go for the more flexible monitoring ATLAS needs for the future. 6 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI ATLAS. Job processing monitoring. PANDA and Dashboard were used for productions and analysis. Production monitoring is in a good shape. PANDA is very useful for debugging eventual problems, Dashboard provides better historical views. Monitoring of the analysis jobs needs considerable improvements. Problems seen with Dashboard Job Monitoring for analysis: 1). Instability of the MonAlisa server which had to be rebooted almost every day. Might be wrong configuration, CMS MonAlisa server works just perfectly under much higher load than the ATLAS one. To be checked with MonAlisa experts. 2). In general ATLAS version of Dashboard job monitoring differs from the CMS one which is constantly improving ( working from both sides CRAB and Dashboard). Have to apply to the ATLAS instance the modifications done on the CMS Dashboard. 7 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI ATLAS (continuation) Monitoring of the central services ATLAS considers SLS as a good infrastructure for service monitoring and is using it for monitoring of its services. Looking in overall picture (4 VOs) Not so much. WLCG daily operations meetings usually communicated the necessary information. General comments regarding the future development - At the moment all monitoring is an aggregation of lower level information. ATLAS needs to find some way of building up an ATLAS Grid Dashboard that looks at some higher level metrics, e.g., number of functional test datasets subscribed in the last 6 hours (if this is low, there is a trouble trouble). - In the future ATLAS foresees slow control systems built on this monitoring, so all monitoring systems should provide machine-readable format, not just plots. 8 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI CMS Same as for ATLAS. In general, CMS has a monitoring infrastructure in place necessary to follow in detail its’ computing activities, though some work and improvements are foreseen. 9 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI CMS transfer monitoring PHEDEX was used. No particular issues were mentioned in the CMS reports regarding transfer monitoring 10 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI CMS Production monitoring CMS used multiple systems. T0AST for T0 monitoring, native glideins monitoring and CMS Dashboard for monitoring of the reprocessing Known issues (in fact known from the CCRC08) - Insufficient reporting from the ProdAgent to Dashboard. ProdAgent (PA) does not report to Dashboard job status information from the user interface, for example when job is killed or aborted. CPU and Wall Clock time, number of processed events are not reported from ProdAgent to Dashboard as well 11 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI CMS analysis monitoring Users are mainly relying on the output of ‘CRAB –status command’ and Dashboard Task monitoring. Dashboard Task monitoring is extensively used by users daily ( distinct analysis users daily are submitting their jobs to the GRID) For STEP09 the overall picture was required. CMS Dashboard interactive UI, CMS Dashboard programmatic interface and native glideins monitoring were used Issues - Reporting to Dashboard from jobs submitted via CRAB server to condor- glideins was in process of debugging during STEP09. Due to it Dashboard statistics for glideins jobs was a bit higher than in reality. - Dashboard historical views provide information in terms of jobs, not in terms of CPU or WallClock time. CPU and WallClock distributions are being added in the new version of the historical view which is under development Improvements foreseen -Understand and provide comprehensive picture for Analysis Support team. Most of needed information exists in Dashboard. Dashboard team is working together with the CMS to come up with appropriate interface for Analysis Support shifters. The twiki page created by CMS for STEP09 analysis test provides a good input for Dashboard developers as well. 12 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI CMS (continuation) Looking in the overall picture (all 4 experiments) Same as ATLAS. Were too busy to see what other experiments were doing. In case CMS needed to understand issues at the particular site mostly relied on input provided by site administrators. Did not have a chance to validation of the new systems like SiteView, mostly due to time restrictions. 13 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI LHCb Both for transfer and data processing monitoring used Dirac portal which provided sufficient information to follow STEP09 activities For status of CEs at the sites used SAM portal and Dashboard interface for VO-specific SAM tests. Foreseen improvements: Correlate monitoring and accounting information from DIRAC + SAM test results + GGUS portal + GOCDB downtime information for a more automatized management of LHCb computing resources. For example to avoid situations when the site is banned without good reason. 14 Julia Andreeva IT-GS

Enabling Grids for E-sciencE EGEE-III INFSO-RI Conclusions Existing monitoring systems though not being perfect did provide necessary information to follow the STEP09 activities. The issues and problems seen during STEP09 define the short term development plans in the monitoring area. 15 Julia Andreeva IT-GS