Download presentation
Presentation is loading. Please wait.
Published byKristina Moody Modified over 9 years ago
1
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf of the ARDA Dashboard team
2
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 2 Outline Dashboard project ATLAS DDM system DDM Dashboard Monitoring for operators Monitoring for end users Conclusion
3
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Dashboard Project Started inside the ARDA group of the EGEE/LCG project in 2005 –Initially covering only job monitoring for CMS Evolved into a python framework providing a set of flexible tools allowing coverage of other grid application areas The framework consists of a set of different components –Data access layer (DAO) –Service configuration (agents) –Web application –Command line tools –APIs Strong focus on allowing easy access to the information –HTTP query interface –Output in HTML (web interfaces), but also XML and CSV for integration with external tools Applications currently cover job monitoring (for all HEP experiments + VLEMED/Biomed), data management, site efficiency / reliability, and many others CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 3
4
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t ATLAS DDM System Distributing all the data in the ATLAS infrastructure Data is organized in datasets – collections of files Users issue subscriptions on these datasets Different agents take care of the several tasks required for the successful movement of the data CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 4
5
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t ATLAS DDM System Each set of agents serves one or a group of sites (typically related to each other - cloud) Initial deployment involved the setup of these services on the VO boxes at each Tier1 Debugging was extremely difficult (still is…) –Involved logging into each of the machines –And correlating this information… Accounting and monitoring virtually impossible A central point where all the information would be made available became vital CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 5
6
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DDM Dashboard Main focus on ATLAS specific services (DQ2 system), receiving information from the different agents via HTTP callbacks –Transfer state changes –Dataset complete –Transfer complete –Transfer / registration errors But also on grid fabric services –Data management related services up and running –Storage space availability Data is put together in a structured way –Oracle database at CERN Different tools (agents) responsible for generating statistics and metrics CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 6
7
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DDM Dashboard Serves different sets of use cases, coming from different types of users Site / system operators –“How is the overall system doing”? –“How is site X doing”? –“What is the most common error, and what is triggering it”? End users / production coordinators –“What is the status of this (set of) dataset subscription(s)”? –“When will the data become available”? –Essential to have real time information How much data? A lot! –Millions of file transfers, each reporting the different steps Average week means 2 million hits (90%+ bulk reports with up to 500 items) –Especially critical when systems misbehave (more errors) –Lot of work on partitioning the data, optimizing the database and the web server setup (Apache) CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 7
8
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Monitoring for operators “How is the whole system performing”? CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 8
9
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Monitoring for operators “What is wrong with site X”? CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 9
10
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Monitoring for operators “What files are causing error ‘…’”? CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 10
11
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Monitoring for operators “What files are causing error ‘…’”? CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 11
12
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Monitoring for operators “What files are causing error ‘…’”? CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 12
13
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Monitoring for end users “I subscribed to dataset X on site Y. What is the status”? CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 13
14
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Conclusion Essential tool for all ATLAS operations Currently used by a large number of people, from site / system responsibles to end users Usage goes beyond the web interface –Data being queried by different external tools for automating operations (catalogs cleanup, consistency checks, alarms and notifications, …) What is coming next –Integration with the site specific monitoring tools –More alarms and notifications –Automated reaction to specific events –More focus on the end user CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 14
15
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Tier 0 export http://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site Production http://dashb-atlas-data.cern.ch/dashboard/request.py/site Homepage http://dashboard.cern.ch Contact dashboard-support@cern.ch CHEP2007,Victoria, Canada ricardo.rocha@cern.ch - 15
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.