Download presentation
Presentation is loading. Please wait.
Published byCornelia Griffith Modified over 9 years ago
1
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services GS group meeting 07.03.08 Monitoring and Dashboards section Activity Overview
2
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Monitoring and Dashboards section Julia Andreeva James Casey Catalin Cirstoiu Benjamin Gaidioz Anastasia Ivanchenko Gerhild Maier Andrey Nechaevskiy Daniel Rodrigues Ricardo Rocha Pablo Saiz Irina Sidorova Alexander Uzhinskiy Sergey Belov 5 staff, 4 project associates, 1 Openlab fellow funded by EDS, 1 PhD student, 1 technical student, 1 visitor (collaboration with Dubna) Members
3
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Main directions of work Dashboard project Covering 4 LHC experiments, various areas of activities and monitoring aspects: job monitoring, data management monitoring, monitoring of sites and services Architecture of monitoring solutions for WLCG Coordination of collaboration with GridView and monitoring activity in OSG CCRC08 monitoring
4
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Julia Coordinating the Dashboard project Managing and support of the Dashboard for CMS Contributing to the development of the CMS Dashboard applications - Job monitoring - Site availability based on the results of the SAM tests - CMS MC production monitoring Redesign of the dashboard job monitoring application (schema, collectors, UI) for support of the pilot jobs Chairing System Analysis Working Group Coordinating development of the common application for monitoring of the LHC experiments workflows for CCRC08 and beyond Section Leader
5
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services James Architecture of monitoring solutions for WLCG Focus on: –Site Monitoring with Nagios –ActiveMQ messaging system as transport layer –APIs/Protocols to present the information to other tools Manage the Gridview collaboration for CERN –With Rajesh Kalmady from BARC Collaborate with OSG on interoperation of monitoring for WLCG CCRC’08 –ServiceMap –MoU reporting
6
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Dashboard Framework –Development of the common components –Configuration and logging, Database access, Command line tools, Messaging and RPC APIs, Web application, Agent Startup and management, … Dashboard Build –Python oriented, based on distutils –Enforces common procedures on developers module structure, package naming and versioning, no need for direct cvs interaction for tagging or branching –Gives back automatic generation of binary tools (like CLIs), documentation, deliverables –Multiple release branches with RPMs and tarballs, APT and YUM repositories Support of both –Within the dashboard team –Within the ATLAS DDM team, which uses the Dashboard Build and several framework components: configuration/logging, messaging, agent configurator Ricardo
7
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ATLAS DDM Dashboard –Monitoring of the ATLAS Distributed Data Management system –Single entry point to get an overview of dataset subscriptions, transfer throughput, transfer and registration errors, site services health –But also detailed information regarding individual transfer attempts Coordination of ATLAS Monitoring activities –Tools like the Dashboards: DDM, Production system, Job and Task monitoring, Panda –Integration with other components like SAM, software installations, file consistency checks, WLCG monitoring Ricardo
8
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Benjamin dashboard activities: –job monitoring: reimplementation two years ago, maintenance for ATLAS, LHCb and ALICE (with Pablo Saiz). installation guide (installed and maintained outside CERN by VleMED in NIKHEF). –ATLAS dashboard: production monitoring: assistance to shifters, CLI, API, user's guide, tutorials, etc. –framework: level: guru couple of code contributions, also developer's guide and tutorials.
9
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Pablo in the Dashboard One of the dashboards developers: –Grid reliability for the 4 LHC VOs Thanks to Eamonn!! –CMS Site status board –CMS Input Collections Monthly site efficiency report -Taken over from Massimo
10
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Pablo in ALICE Main developer of several AliEn components: –File & metadata catalogue –TaskQueue and JobAgent system –File Transfer Daemon Support for the previous components One of the ‘on-call grid experts’ (together with Patricia and Fabrizio)
11
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Pablo’s current activities Please, don’t hate me…. (please excuse me for not presenting this myself… )
12
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Daniel Rodrigues :: Openlab fellow funded by EDS. Objectives: Validation and testing of a new grid messaging system for deployment within wlcg Re-engineering of components within the WLCG Service to use the new messaging system Tasks Summary: Testing of Apache ActiveMQ performance and features for usage as broker: https://twiki.cern.ch/twiki/bin/view/LCG/GridPublisherDevelopmenthttps://twiki.cern.ch/twiki/bin/view/LCG/GridPublisherDevelopment Integrating existing components for using the messaging system: Gridview gridftp logs + Dashboard. https://twiki.cern.ch/twiki/bin/view/LCG/GridPublisherSpecificationGridView https://twiki.cern.ch/twiki/bin/view/LCG/GridPublisherSpecificationGridView Writing best practices for using a messaging system within the WLCG context for other developers. Presentation: Openlab, January 08: http://dfrodrig.web.cern.ch/dfrodrig/AnOverviewOnAMessagingSystemForT heGrid_v1.2.pps Daniel
13
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Catalin Development of the monitoring system for ALICE based on Monalisa Monalisa related support for Dashboards (ATLAS and CMS), including installation and support of the Monalisa servers and repositories Work on PhD “Optimization Framework for Data Intensive Applications in Large Scale Distributed Systems”, framework for optimization of data transfers based on Monalisa Unfortunately, Catalin is leaving us soon. GOOD LUCK!
14
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Irina Activities 2007 Implemented central repository for CMS MC production monitoring information Set up procedures for aggregation of monitoring data in summary tables used by the UI Develop the API for data retrieval in the XML format
15
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Irina Current activities and plans Take part in the redesign of the Dashboard schema for job monitoring. The new schema should support the pilot job submission which is more and more used by the LHC experiments Take part in the development of data collection for the Dashboard job monitoring
16
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Andrey and Alexander Operations Procedures Daily log – tracking of current problems and open issues Weekly Report – summary report for the Joint Operations Meeting Weekly Tier-0 – summary of issues noticed on the Castor Tier-0 service
17
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Andrey & Alexander Summary Operating activities DashBoard installation new FTS SLC4 pilot has been installed right now we are working under development of the new schema or schema-patch(supposably it will be separate from the FTS schema and will be installing like module) with DB part for the monitoring prototype Next plans: test new pilot strat implementation of our Monitoring Tools in DashBoard
18
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Gerhild PhD Work Automatic Detection of Error Sources of Failed Grid Jobs reported exit codes ≠ description of error source first: distinction between user’s fault and site’s fault f Dashboard Database ? Patterns Rules Report, Web Page, Alert System, … Data Data Mining Additional Knowledge Representation User, Site, Exit Code,… Looking at the data,… All jobs of user X fail,… List of observations
19
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Gerhild Dashboard Work Web Interfaces in production for CMS: Daily Job Summary: information about jobs visualized with plots −terminated, submitted, pending, running jobs −status of terminated jobs −failed jobs by reason (grid errors, application errors) −status of site load −parallel running jobs Task Monitoring: detailed information about a user’s tasks (also in production for ATLAS) SAM Test Result Visualization −latest test results, historical test results −site and service availability plots −test history future work: maintenance, improvements
20
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Anastasia Current tasks: ● implement job summary plots using GraphTools library ● implement Dashboard home page Future tasks: ● take part in the development of the common system for monitoring of the workflows for the LHC experiments
21
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Sergey Job monitoring for Condor Goal is to obtain extra information about job submitted via Condor-G, even before job start Extended job information from Condor event log – The event of interest is job status change – It’s possible to get user-specified attributes from ClassAd Tool runs as a job on Condor submission host Data is prepared to be in accordance with other Dashboard job information Job information is sending to collection server (using messaging system or MonALISA) Current task is to finalize the developments and to provide the tool for real tests on production sites
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.