Download presentation
Presentation is loading. Please wait.
Published byAshlynn Cain Modified over 8 years ago
1
Enabling Grids for E-sciencE http://arda.cern.ch Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS On behalf of the Dashboard team J. Andreeva, S. Belov, C. Cirstoiu, Y.Chen, B. Gaidioz, J. Herrala, G. Maier, R. Pezoa Rivera, R. Rocha, P. Saiz, I. Sidorova
2
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 2 Table of content Common requirements for Grid monitoring from the VO/User perspective Motivation and evolution of the Experiment Dashboard project Overview of the current functionalities Future plans Conclusions
3
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 3 Common requirements regarding VO monitoring Provide transparent and complete picture of the experiment activities on the Grid regardless the underlying infrastructure where the actual job/transfer/service is running Combine Grid monitoring data with experiment/application/activity specific information of the VO interest Be able to identify problem of any nature (Grid or application) Satisfy users with various roles (various areas of activities, different scope) Provide high level of flexibility to allow rapid integration with the new requirements Scale and perform well
4
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 4 Experiment Dashboard concept Information sources Generic Grid Services Experiment specific services Experiment work load management and data management systems Jobs instrumented to report monitoring information Monitoring and accounting systems (RGMA, GridIce, SAM, ICRTMDB, MonaAlisa, BDII, Apel, Gratia…) Collect data of VO interest coming from various sources Store it in a single location Provide UI following VO requirements Analyze collected data VO users with various roles Potentially other Clients: PANDA, ATLAS production
5
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 5 Dashboard Framework Web / HTTP Interface Data Access Layer (DAO) Agents Oracle DB DB reading and writing via DAO layer Connection pooling Easy to add interface for a different backend Agents are running on regular basis. Collecting data from different sources, generating/analyzing statistics, managing alarms Common configuration and management and common monitoring mechanism Dashboard clients: scripts(pycurl…), cli (optparser + pycurl) shell based (curl…) Web application based on Apache + mod python Multiple output formats: plain text, csv, xml, xhtml GSI support using gridsite
6
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 6 Development team The tool is developed by ARDA (CERN) team in collaboration with MonAlisa (Caltech) developers and participation of ASGC (Taiwan), MSU and JINR (Russia) and LAL (France) Valuable contribution this year from CERN summer students People contributed to the development: J. Andreeva, S. Belov, A. Berejnoj, C. Cirstoiu, Y.Chen, T.Chen, S. Chiu, M. De Francisco De Miguel, A. Ivanchenko, B. Gaidioz, J. Herrala, M. Janulis, O. Kodolova, G. Maier, E.J. Maguire, C. Munro, R. Pezoa Rivera, R. Rocha, P. Saiz, I. Sidorova, F.Tsai, E. Tikhonenko, E. Urbah
7
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 7 Evolution of the project 07/0510/0501/0604/0607/0610/0603/0701/07 First prototype for CMS job monitoring Transfer monitoring for ALICE in production Job monitoring for LHCb in production Job monitoring for ALICE in production Job monitoring for CMS and ATLAS in production ATLAS Data Management in production 06/07 Dashboard for VLEMED (BioMed) Monitoring for ATLAS Production (prototype) Task monitoring for CMS user analysis in production 08/07 Site/service availability based on SAM tests (CMS)
8
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 8 Dashboard covers wide range of the activities of the LHC experiments Transfer monitoring for ALICE Data management monitoring for ATLAS Production monitoring for ATLAS and CMS (prototypes) IO rate monitoring between WN and SE (prototype) Site availability based on the results of SAM tests (prototype) Job Robot monitoring Accounting information from Apel and Gratia for ATLAS (prototype) Task monitoring for CMS analysis users (ATLAS on the way) Job monitoring Site reliability Experiment Dashboard COMMON applications ALICE, ATLAS, CMS, LHCb, Vlemed CMS Integration and commissioning Experiment specific applications
9
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 9 Job monitoring What is the status of the jobs -belonging to an individual user/group/VO -submitted to a given site or Grid flavor or via a given resource broker -reading a certain data sample, running a certain application… If they are pending/running – for how long, where? If they are finished, whether they failed or ran properly? If they failed – why?
10
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 10 Information flow for job monitoring Job Submission Tools (CRAB, ProdAgent, Panda, Ganga Dashboard for Job Monitoring Grid monitoring systems (RGMA, ICRTM, GridIce, BDII) Experiment specific monitoring systems (Production system in Atlas, Dirac monitoring) Jobs at the WNs MonAlisa service At the submission time META information about user task Submission info for individual jobs, job status info while retrieving output or checking job status Running jobs report their progress Grid status info only for jobs submitted via RB (RGMA, ICRTM) Jobs status according to the local batch system (only where GridIce is running) In collaboration with condor_g team currently working on reporting of the job status information from condor_g submitter to Dashboard via MonAlisa Due to multiple information sources Dashboard Job Monitoring application is not limited to a given middleware flavour or to a given submission method
11
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 11 Example of Job monitoring UI
12
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 12 Monitoring of the analysis tasks for the CMS users Meta information about task Detailed info about all jobs of a given group Distribution of jobs of a given group by Site, CE or RB
13
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 13 Site Reliability Application ‘Site of the day’: –daily report on number of successful/failed job attempts (submission via RB) Site performance –Evolution of a site over a period of time Error list (Grid errors related to job processing by RB) –Most Common list of error messages, with pointers to documentation –Evolution of the error over time Waiting time –Time that users have to wait from the moment they submit the job until they get the results back Aggregated reports: –Automatic monthly reports –Multi VO reports For more details about Job Reliability Application see “Grid reliability” presentation of Pablo Saiz
14
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 14 Data Management Monitoring for ATLAS Tied to the ATLAS Distributed Data Management (DDM) system Used successfully both in the production and Tier0 test environments Data sources: –DDM site services: the main source, providing all the transfer and placement information –SAM tests: for correlation of DDM results with the state of the grid fabric services –Storage space availability: currently from BDII but soon including other available tools Views over the data: –Global: site overview covering different metrics (throughput, files / datasets completed,...); summary of the most common errors (transfer and placement) –Detailed: starting from the dataset state, to the state of each of its files, to the history of each single file placement (all state changes) For more details about ATLAS Data Management Monitoring see Thursday presentation of Ricardo Rocha
15
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 15 Integration and commissioning for CMS (1) Site and service availability based on the results of SAM tests (Prototype) Results of SAM tests are imported in real time to the dashboard DB Service and site availability are calculated according the experiment’s requirements Select site or set of sites and service types to be included in the report
16
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 16 Integration and commissioning for CMS (2) Monitoring of the I/O rate between WN and SE (Prototype) Currently only analysis and JobRobot jobs are reporting I/O rate to the dashboard
17
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 17 Integration and commissioning for CMS (3) Monitoring of the Job Robot jobs using Dashboard interactive UI Sites having troubles
18
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 18 Monitoring for the production systems of ATLAS and CMS In case of ATLAS, Dashboard will provide a user interface to the monitoring data stored in the ATLAS production DB. UI is in active development phase. First prototype is available. In case of CMS, Dashboard DB is used as a central repository for monitoring data. Dashboard Collector for monitoring information from CMS ProdAgent instances is in production. UI interface to the central monitoring repository is being developed.
19
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 19 ATLAS production UI (prototype)
20
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 20 ATLAS accounting information Information is retrieved from Apel and Gratia UI allows to show data taking into account experiment topology
21
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 21 Experiment Dashboard plans Implementation Important schema modification is required to support pilot jobs. It will cause modification in the data feeding part and user interface Secure access where relevant (X509 authentication) Improvement of data completeness and reliability Enabling of reporting of job status information from condor_g (on the way) Development of the new applications Monitoring for production systems of ATLAS and CMS Service availability based on sanity check reports sent by the experiment jobs (LHCb) Improvement of effectiveness for troubleshooting Analyzing of information about failures (Grid and application) Decoupling application failures caused by the error in the user code from the failures caused by the problems of the Grid services Collecting troubleshooting recipes, making them available at the dashboard UI Correlating where relevant failures with the results of the SAM tests
22
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 22 Conclusions Experiment Dashboard is used by all 4 LHC experiments and evolving very fast to match their requirements Job Monitoring and Site Reliability applications are in production for VLEMED VO outside LHC community The tool proved to provide reliable and useful VO- oriented monitoring data, with needed level of details, available in various formats. Give it a try ! http://dashboard.cern.ch
23
CHEP 2007, Victoria, Canada Julia Andreeva, CERN 23 Acknowledgement We would like to thank Stefano Belforte, Massimo Lamanna and Iosif Legrand. Without their support and guidance the project wouldn’t start and wouldn’t progress; Our collaborators in Taiwan, Russia and France for their valuable contribution; ORACLE support team for excellent DB support and useful advices; SAM, ICRTM, RGMA, GridIce, Condor_g, EIS, FIO at CERN IT, Gratia and Apel teams for fruitful collaboration and prompt responses to our requests; Developers of job submission tools, production systems, data management systems of the LHC experiments for their contribution and LHC user community for useful feedback.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.