LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service monitoring from the VO perspective Thanks to Julia Andreeva and E. Karavakis for the slides
Dashboard for Monitoring the Computing Activities of the LHC Analysis + Production Real time and Historical Views Data transfer Data access Site Status Board Site usability SiteView WLCG GoogleEarth Dashboard 14/04/ Monitoring of the LHC computing activities during the first year of data taking
Common Solutions ApplicationATLASCMSLHCbALICE Job monitoring (multiple applications) Site Status Board Site Usability Monitoring DDM Monitoring global transfer monitoring system (en projet 2011) SiteView & GoogleEarth 14/04/2011 3
Job Monitoring Aimed at different types of users: individual scientists using the Grid for data analysis, user support teams, site admins, VO managers, managers of different computing projects Works transparently across different middleware, submission methods and execution backends 14/04/ Monitoring of the LHC computing activities during the first year of data taking
Job monitoring During 2010, Dashboard job monitoring for ATLAS was completely redesigned. Most of applications are shared with CMS. The shared components are data schema of the data repositories and user interfaces. Information sources are different => collectors are different as well. In case of CMS, CMS job submission tools (servers and job themselves) are instrumented to report job status information to Dashboard. In case of ATLAS, Dashboard is integrated with PANDA job monitoring DB. The Dashboard collector retrieves data from the PANDA DB every 5 minutes. Jobs submitted via Ganga through WMS or to local batch systems are instrumented to report their status via Messaging System for the Grid (MSG) based on ActiveMQ.
The following applications were enabled for ATLAS: Interactive view Historical view Task monitoring (first prototype) CMS job monitoring was extended in order to collect file access information which is used by the Data Popularity service. During next half of the year: New version of the Historical view will be enabled for CMS Continue effort aimed to improve performance, both for data collectors and Uis Development of the new version of ATLAS task monitoring for the analysis users with the possibility to resubmit/kill jobs via the monitoring UI Job monitoring
Task Monitoring Distribution by Site Detailed Job Information Distribution by Status Processed Events over Time Failure Diagnostics for Grid and Application Failures Efficiency Distributed by Site 14/04/ User / User-support perspective Wide selection of plots CMS & ATLAS >350 CMS users daily Monitoring of the LHC computing activities during the first year of data taking
Job Summary & Historical Views 14/04/ Job Summary Shifter, Expert, Site perspective Real time job metrics by site, activity, … Historical Views Site, Management perspective Job metrics as a function of time Monitoring of the LHC computing activities during the first year of data taking
Data transfer monitoring New version of the ATLAS Distributed Data Management monitoring (ATLAS DDM Dashboard). Improved visualization (matrix) which allows to monitor data transfer by source or by destination and to spot easier any tranfer problems. The first prototype was released in May and is already in use by the ATLAS community: First feedback is very positive.
During second half of 2011, in collaboration with the GT group of the CERN IT department will start to develop the global transfer monitoring system. The distributed FTS instances will be instrumented for reporting of the data transfer events via MSG. Dashboard collector will consume these events and record them into the central data repository, generate overall transfer statistics and expose this info to the user community via UIs and APIs. Most of ATLAS DDM Dashboard code should be re-used for the new data transfer monitoring system. More details can be found at: oring The detailed roadmap for this project is not yet defined. oring Data transfer monitoring
Site/service monitoring Include following applications: Site Usability (based on the results of SAM tests) Site Status Board WLCG Google Earth Dashboard
Site Usability Monitoring (SUM) During 2010 and beginning of 2011 SAM framework was completely redesigned and new version is based on Nagios. The LHC VOs started to submit remote tests via Nagios. The Dashboard Site Usability application is being re- designed to be compatible with the new SAM architecture. The first prototype was deployed on the validation server in April and should be validated by the LHC VOs. New SUM should be deployed to production by the end of summer 2011.
SUM Snapshots 14/04/ Monitoring of the LHC computing activities during the first year of data taking
Site Status Board (SSB) During second half of 2010 a lot of improvements were implemented for SSB: New version of the collectors which allowed to solve the DB locking problem and to provide the necessary level of performance were deployed in production (February 2011) New version of the UI with improved performance and extended functionality were deployed in production (Spring 2011) Both ATLAS and CMS are using SSB for the computing shifts and site commissioning activity Further development will follow the needs and requests of the LHC VOs
SSB Snapshot Maintenance Easy to identify sites with problems Grouped sites 14/04/ Monitoring of the LHC computing activities during the first year of data taking
WLCG Google Earth Dashboard GoogleEarth Dashboard is integrated with all VO-specific monitoring systems, including Dirac and MonAlisa, so it shows activities for all 4 experiments Recent development was focussed on the improvement of the robustness and reliability of the application.