Download presentation
Presentation is loading. Please wait.
Published byColeen Foster Modified over 9 years ago
1
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007
2
Andrea Sciabà CERN Outline Monitoring for the Monte Carlo production Monitoring for the user analysis Monitoring for data transfer Monitoring of central Grid services Conclusions
3
Andrea Sciabà CERN Monte Carlo production What information is needed? Status of the computing resources Are they working? How busy are they? Status of the local storage resources Are they working? How full are they? Status of the central Grid services LCG RB / gLite WMS –Is it working? –How busy is it? –How long does it take for a job to be assigned to a CE? VOMS –Is it working? BDII –Does it contain reliable information?
4
Andrea Sciabà CERN Status of the CE (I) Is the CE working? The SAM tests can be used to answer the question Basic sanity checks More specific tests –E.g. is it possible to stage out a file from WN to local SRM? –Is FroNTier working at the site? –Are the required versions of CMSSW installed? The time granularity of SAM tests is too large Not reasonable to run them more often than hour (now it is every two hours) Is there a way to have more fine-grained information? –Yes, by running the tests from the site itself –How to make the information available?
5
Andrea Sciabà CERN Status of the CE (II) How busy is the CE? This information is needed to make the right choice about how to distribute jobs Source #1: the information system Needs to contain reliable information Sites must have the possibility to be confident that they are publishing the correct information Source #2: fabric monitoring Could be used to cross-check the information in the IS Currently impossible to make it available to the experiment at all sites in a homogeneous way
6
Andrea Sciabà CERN Grid monitoring as used by the MC production system Now Automatic exclusion of CEs from the BDII by FCR ~ OK means "can run a hello world job" The SAM test results are periodically checked by people to maintain a list of good/bad CEs The list of RB/WMS instances to be used is similarly maintained by hand based on reports about malfunctions or downtimes Ideally Automatic ranking of CEs based on SAM tests and accurate resource usage reporting E.g. to submit jobs to a CE as a function of the CE load To avoid black holes Automatic selection of good RB/WMS instances Possibly also based on SAM Calculation of the used/free space on the local storage
7
Andrea Sciabà CERN Analysis The users want to have a clear and simple picture of which CEs are working and which are not They do not need to know the status of the WMS, but CRAB does Useful to have an estimate of the time it will take for their jobs to start These requirements can be satisfied at the Dashboard level
8
Andrea Sciabà CERN FTS monitoring The main monitoring for data movement is the PhEDEx monitoring It collects already quite a lot of logging information from FTS PhEDEx is self-regulating in case of transfer failures However, useful to have this information Channel configuration parameters Channel status (no. of active and pending transfers, etc.) Load by VO "Estimated time of start" for a new job by VO Current transfer rate for ongoing transfers Callback from the FTS API to know ASAP that there was a failure The FTS monitoring should also have A unique entry point Same information for all servers Easy remote access to transfer logs
9
Andrea Sciabà CERN SRM monitoring Some information is also available via dedicated SAM tests LFN PFN conversion following CMS rules As published in the Trivial File Catalogue Copy forth and back a file between a UI and a remote SRM SAM could be used to store information from the higher level PhEDEx monitoring Information that would be nice to have from SRM Clearer error messages A reliable way to understand if a transfer is really ongoing A better report on the transfer X seconds to prepare, Y to move the file, Z to close out
10
Andrea Sciabà CERN WMS Ideally, the middleware should be able to find out by itself which services to contact In the UI there is a simple random choice from a list and an automatic retry if WMProxy is not able to accept the submission request A better load balancing would be desirable For example, to use the "least loaded" WMS Now a WMS refuses new jobs if the load is too high If the middleware does not provide this functionality, the application must implement it Need to have the right monitoring information WMS deamons status Load Number of jobs still in the task queue Free disk space on WMS Job latency –Submitted matched to a CE submitted to the batch system starts execution
11
Andrea Sciabà CERN Example of WMS monitoring
12
Andrea Sciabà CERN VOMS, MyProxy, BDII No special monitoring needs Either they work, or do not Problems with them will have an immediate effect on all activities
13
Andrea Sciabà CERN Conclusions Most of the monitoring info which is needed is already available in some way or another What is needed is for it to be Accurate Up to date Easy to retrieve by applications Programmatic interfaces Standard format (XML) uniform across different sources Well documented Fast to retrieve Least possible use of authentication Possibly using caching servers not to flood data sources with requests
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.