Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem.

Similar presentations


Presentation on theme: "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem."— Presentation transcript:

1 EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem Workshop

2 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Jumping to conclusions A variety of tests run during STEP09 ---> a variety of monitoring systems used We certainly were not running blind, and could follow pretty well what is going on For following of the Experiment activities in most cases the VO-specific monitoring systems had been used For checking the health of the services and of the sites VOs mostly relied on the centrally provided monitoring systems like SAM and SLS 2 Julia Andreeva IT-GS

3 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Short questioner sent to all 4 experiments Do you think that your VO had all necessary monitoring tools and they provided required functionality in order to follow STEP09 ? What has to be improved ? Which monitoring systems had been used for every particular test? Was it possible to see the overall picture (all 4 experiments)? Wish list … Thanks a lot for all people providing input and sending answers. 3 Julia Andreeva IT-GS

4 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 ALICE ALICE did not suffer any lack of information regarding monitoring and was able to follow STEP09 activities pretty well. Both for transfer (rate) and job processing ALICE used native ALICE monitoring service based on MonAlisa. For transfer efficiency and errors ALICE used Dashboard. For looking in the overall picture regarding transfer ALICE used GridView. No particular requests regarding monitoring. 4 Julia Andreeva IT-GS

5 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 ATLAS In general, ATLAS did have necessary monitoring infrastructure to follow STEP09, though some issues had been seen and there is a room for improvements (my conclusion from ATLAS answers) 5 Julia Andreeva IT-GS

6 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 ATLAS transfer monitoring For data transfer Dashboard had been used. Good for overall data transfer. Noticed problem: Can magnify a single error so much that it's hard to see anything else (filter out known problems would be useful) What is missing for specific things needed for operations: 1. Monitoring of broken subscriptions 2. Monitoring of queues of subscriptions 3. Monitoring of subscriptions not picked up 4. Information ordered by source 5. Development of drill down plots giving efficiency and bandwidth consumed in a given time period 6. Some work on the pre-stage monitoring, especially for staged files and datasets The work Ricardo did on the 2D plots, generated on the client side, looks to be like a very healthy development. This is probably the way to go for the more flexible monitoring ATLAS needs for the future. 6 Julia Andreeva IT-GS

7 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 ATLAS. Job processing monitoring. PANDA and Dashboard were used for productions and analysis. Production monitoring is in a good shape. PANDA is very useful for debugging eventual problems, Dashboard provides better historical views. Monitoring of the analysis jobs needs considerable improvements. Problems seen with Dashboard Job Monitoring for analysis: 1). Instability of the MonAlisa server which had to be rebooted almost every day. Might be wrong configuration, CMS MonAlisa server works just perfectly under much higher load than the ATLAS one. To be checked with MonAlisa experts. 2). In general ATLAS version of Dashboard job monitoring differs from the CMS one which is constantly improving ( working from both sides CRAB and Dashboard). Have to apply to the ATLAS instance the modifications done on the CMS Dashboard. 7 Julia Andreeva IT-GS

8 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 ATLAS (continuation) Monitoring of the central services ATLAS considers SLS as a good infrastructure for service monitoring and is using it for monitoring of its services. Looking in overall picture (4 VOs) Not so much. WLCG daily operations meetings usually communicated the necessary information. General comments regarding the future development - At the moment all monitoring is an aggregation of lower level information. ATLAS needs to find some way of building up an ATLAS Grid Dashboard that looks at some higher level metrics, e.g., number of functional test datasets subscribed in the last 6 hours (if this is low, there is a trouble trouble). - In the future ATLAS foresees slow control systems built on this monitoring, so all monitoring systems should provide machine-readable format, not just plots. 8 Julia Andreeva IT-GS

9 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 CMS Same as for ATLAS. In general, CMS has a monitoring infrastructure in place necessary to follow in detail its’ computing activities, though some work and improvements are foreseen. 9 Julia Andreeva IT-GS

10 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 CMS transfer monitoring PHEDEX was used. No particular issues were mentioned in the CMS reports regarding transfer monitoring 10 Julia Andreeva IT-GS

11 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 CMS Production monitoring CMS used multiple systems. T0AST for T0 monitoring, native glideins monitoring and CMS Dashboard for monitoring of the reprocessing Known issues (in fact known from the CCRC08) - Insufficient reporting from the ProdAgent to Dashboard. ProdAgent (PA) does not report to Dashboard job status information from the user interface, for example when job is killed or aborted. CPU and Wall Clock time, number of processed events are not reported from ProdAgent to Dashboard as well 11 Julia Andreeva IT-GS

12 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 CMS analysis monitoring Users are mainly relying on the output of ‘CRAB –status command’ and Dashboard Task monitoring. Dashboard Task monitoring is extensively used by 50-100 users daily (80-130 distinct analysis users daily are submitting their jobs to the GRID) For STEP09 the overall picture was required. CMS Dashboard interactive UI, CMS Dashboard programmatic interface and native glideins monitoring were used Issues - Reporting to Dashboard from jobs submitted via CRAB server to condor- glideins was in process of debugging during STEP09. Due to it Dashboard statistics for glideins jobs was a bit higher than in reality. - Dashboard historical views provide information in terms of jobs, not in terms of CPU or WallClock time. CPU and WallClock distributions are being added in the new version of the historical view which is under development Improvements foreseen -Understand and provide comprehensive picture for Analysis Support team. Most of needed information exists in Dashboard. Dashboard team is working together with the CMS to come up with appropriate interface for Analysis Support shifters. The twiki page created by CMS for STEP09 analysis test provides a good input for Dashboard developers as well. 12 Julia Andreeva IT-GS

13 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 CMS (continuation) Looking in the overall picture (all 4 experiments) Same as ATLAS. Were too busy to see what other experiments were doing. In case CMS needed to understand issues at the particular site mostly relied on input provided by site administrators. Did not have a chance to validation of the new systems like SiteView, mostly due to time restrictions. 13 Julia Andreeva IT-GS

14 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 LHCb Both for transfer and data processing monitoring used Dirac portal which provided sufficient information to follow STEP09 activities For status of CEs at the sites used SAM portal and Dashboard interface for VO-specific SAM tests. Foreseen improvements: Correlate monitoring and accounting information from DIRAC + SAM test results + GGUS portal + GOCDB downtime information for a more automatized management of LHCb computing resources. For example to avoid situations when the site is banned without good reason. 14 Julia Andreeva IT-GS

15 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Conclusions Existing monitoring systems though not being perfect did provide necessary information to follow the STEP09 activities. The issues and problems seen during STEP09 define the short term development plans in the monitoring area. 15 Julia Andreeva IT-GS


Download ppt "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem."

Similar presentations


Ads by Google