DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

Slides:



Advertisements
Similar presentations
TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.
Advertisements

Module 5: Routing BizTalk Messages. Overview Lesson 1: Introduction to Message Routing Lesson 2: Configuring Message Routing Lesson 3: Monitoring Orchestrations.
ONE STOP THE TOTAL SERVICE SOLUTION FOR REMOTE DEVICE MANAGMENT.
Week 6: Chapter 6 Agenda Automation of SQL Server tasks using: SQL Server Agent Scheduling Scripting Technologies.
Workload Management meeting 07/10/2004 Federica Fanzago INFN Padova Grape for analysis M.Corvo, F.Fanzago, N.Smirnov INFN Padova.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
1 DDM Troubleshooting Console (quick evaluation of splunk for such) Rob Gardner University of Chicago ATLAS DDM Workshop January 26, 2007.
Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.
How to Install and Use the DQ2 User Tools US ATLAS Tier2 workshop at IU June 20, Bloomington, IN Marco Mambelli University of Chicago.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
Module 14 Monitoring and Optimizing SharePoint Performance.
BNL DDM Status Report Hironori Ito Brookhaven National Laboratory.
PanDA Monitor Development ATLAS S&C Workshop by V.Fine (BNL)
Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Metadata Mòrag Burgon-Lyon University of Glasgow.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
DELETION SERVICE ISSUES ADC Development meeting
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
MySQL and GRID status Gabriele Carcassi 9 September 2002.
ATLAS Dashboard Recent Developments Ricardo Rocha.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
Julia Andreeva on behalf of the MND section MND review.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
Oct HPS Collaboration Meeting Jeremy McCormick (SLAC) HPS Web 2.0 OR Web Apps and Databases (Oh My!) Jeremy McCormick (SLAC)
Distributed Data Management Miguel Branco 1 DQ2 discussion on future features BNL workshop October 4, 2007.
30 Copyright © 2009, Oracle. All rights reserved. Using Oracle Business Intelligence Delivers.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Interactions & Automations
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
DDM Central Catalogs and Central Database Pedro Salgado.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
The GridPP DIRAC project DIRAC for non-LHC communities.
Distributed Data Management Miguel Branco 1 DQ2 status & plans BNL workshop October 3, 2007.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Data Management at Tier-1 and Tier-2 Centers Hironori Ito Brookhaven National Laboratory US ATLAS Tier-2/Tier-3/OSG meeting March 2010.
ConTZole Tomáš Kubeš, 2010 atlas-tz-monitoring.cern.ch An Interactive ATLAS Tier-0 Monitoring.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
SQL Database Management
The ATLAS “DQ2 Accounting and Storage Usage Service”
ATLAS Use and Experience of FTS
Key Activities. MND sections
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
FTS Monitoring Ricardo Rocha
New monitoring applications in the dashboard
Discussions on group meeting
Presentation transcript:

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha

How it works DDM Monitoring (today) Dashboard Monitoring (near future)

How It Works

monitoring/ Monitoring information is sent using callbacks They are regularly sent to a central monitoring service based on the same code as DQ2 central catalogs Apache web service with backend MySQL DB The callback parameter can be configured on the site services installation Extra callbacks on events/errors can be specified on the registerDatasetSubscription command dq2 registerDatasetSubscription my_ds CERNCAF

How It Works Types of callbacks --monitor executed each time there is a state transition on the site services agent --callbacks-file-done executed each time a file on that subscription is successfully transferred --callbacks-vuid-complete executed when all dataset subscription files are copied --callbacks-error in case an error occurs while processing the subscription sends error type and description Examples --callbacks-file-done=

How It Works Database is composed by three tables file states errors summary information per hour (just throughput, no of files copied and no of errors) Data is deleted from the first two tables after a week only file DONE states from the file table - we keep HOLD states For the web pages.. on the ‘current transfers’ pages, data is read directly from the DB cron jobs run every 10 minutes to produce the plots on the main page and fill the summary table

DDM Monitoring

The main page shows what is going on overall, with throughput plots to each site for the last 4 hours, last day and last week On the current transfers page for each site we can see summary of the status of the files subscribed to the site the datasets currently being processed and datasets completed recently the recent throughput in numbers and the ratio of successful transfers to errors the last events and errors that occurred for each dataset, the state of each file for each file, all its attributes and errors

DDM Monitoring

There were about 220,000 accesses to the webpage in the last 2 months ~180,000 from googlebot.com… but the rest quite distributed around the world Dataset overview with different colours Last 100/2h errors page but should be variable It’s fast apart from when errors table gets very full - see next slide

DDM Monitoring The errors table grew to about 3 million rows after 2 months! which made errors queries and slow this was before automatic deletion after 1 week Much of this was because errors were continually re-sent (unlimited retries) Site overviews of errors Automatic notification of errors to those responsible Measurements of the quality of the transfers We have an open savannah task with more feature improvements

DDM Monitoring We need to get the feeling of how a site is doing overall rather than looking at individual file error messages i.e. the equivalent of site functional tests but for DQ2 but this is difficult, e.g. how to tell the difference between no data subscribed to a site and a problem at the site? How to synchronise monitoring with site when site DB is recreated Success/error ratio is not realistic ‘File exists’ error on file registration in LRC can be repeated 1000’s of times The callbacks put quite a heavy load on the site services (A wider issue) the coordination of downtimes/blacklisting sites

Dashboard monitoring

Dashboard Monitoring Based on the currently available monitoring For the same data being collected you get in addition: different output formats for the same data you see in the webpage (CSV, XML,...) command line tools for all available queries [testing] more flexible queries when dealing with large amounts of data (you can provide a starting date, offset and limit and a subset of states to be retrieved) A new web interface This has been running successfully (collecting data from 2 sites – ASGCDISK and ASGCTAPE)

Dashboard Monitoring But you also get: an ORACLE backend for the collected data (service provided by CERN IT) LCG SAM (Service Availability and Monitoring) critical test results for the fabric services at each site (FTS, LFC, SRM, SE,...) Site Status Evaluation – taking different sources into account (DDM Site Services events, SAM data, other tests...) from here we can also have the state of clouds, not only individual sites When: backend and data querying are finished, CLI tools are completed, we are now fixing last issues with the web interface

Dashboard Monitoring - Plans Immediate: provide historical information on site behavior/performance – like in the current monitoring, but allow also dynamic queries Immediate: deploy the different CLI tools on AFS (will be done by the end of the week) Monitor fabric services individually – especially interesting in the case of the FTS Implement the missing features listed in the Savannah task (most of them are already available) Add dataset browsing (especially dataset location) capabilities to the interface

Dashboard Command Line list complete datasets in ASGCDISK dashb-dataset-list -S COMPLETE ASGCDISK retrieve dataset and dataset location information

Dashboard Command Line file summary on ASGCDISK dashb-file-summary ASGCDISK retrieve a summary of the state of files

Dashboard Command Line list files with state HOLD_NO_REPLICAS in ASGCDISK dashb-file-list -L 6 -S HOLD_NO_REPLICAS ASGCDISK retrieve file and file location information

Dashboard Command Line list errors in ASGCDISK dashb-site-errors ASGCDISK retrieve the errors collected

Dashboard Command Line summary of error types in ASGCDISK dashb-site-summary ASGCDISK retrieve a summary of the error types collected

Dashboard Command Line retrieve status of each site dashb-site-status

Dashboard Command Line all monitoring information on a command line tool user can page results multiple output formats text/xml, application/xhtml+xml, text/csv filtering of results by file state, date of the last event