FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

Slides:



Advertisements
Similar presentations
SIS – NBS Online Specimen Tracking System Training
Advertisements

1 Databases in ALICE L.Betev LCG Database Deployment and Persistency Workshop Geneva, October 17, 2005.
(NHA) The Laboratory of Computer Communication and Networking Network Host Analyzer.
What is so good about Archie and RevMan 5
The Project AH Computing. Functional Requirements  What the product must do!  Examples attractive welcome screen all options available as clickable.
1 Effectively Managing Global Engineering Licenses Kimberley A. Dillman IT Solution Architect – Engineering Delphi Corporation
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
19 February CASTOR Monitoring developments Theodoros Rekatsinas, Witek Pokorski, Dennis Waldron, Dirk Duellmann,
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.
Experience of a low-maintenance distributed data management system W.Takase 1, Y.Matsumoto 1, A.Hasan 2, F.Di Lodovico 3, Y.Watase 1, T.Sasaki 1 1. High.
Joel Bapaga on Web Design Strategies Technologies Commercial Value.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Africa & Arabia ROC tutorial The GSTAT2 Grid Monitoring tool Mario Reale GARR - Italy ASREN-JUNET Grid School - 24 November 2011 Africa & Arabia ROC Tutorial.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
M1G Introduction to Database Development 6. Building Applications.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Chapter 6 SAS ® OLAP Cube Studio. Section 6.1 SAS OLAP Cube Studio Architecture.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
CERN - IT Department CH-1211 Genève 23 Switzerland t DB Development Tools Benthic SQL Developer Application Express WLCG Service Reliability.
08/30/05GDM Project Presentation Lower Storage Summary of activity on 8/30/2005.
What is Sure Stats? Sure Stats is an add-on for SAP that provides Organizations with detailed Statistical Information about how their SAP system is being.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Developer TECH REFRESH 15 Junho 2015 #pttechrefres h Understand your end-users and your app with Application Insights.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
DIAMON Project Project Definition and Specifications Based on input from the AB/CO Section leaders.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Retele de senzori EEMon Electrical Energy Monitoring System.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-II INFSO-RI Enabling Grids for E-sciencE WLCG File Transfer Service Sophie Lemaitre – Gavin Mccance Joint EGEE and OSG Workshop.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
AS Level Computing 8 CHAPTER: Fundamentals of structured programming The basics of good programming Algorithms System Flow Charts Symbol Conventions Steps.
FTS Monitoring Ricardo Rocha
1 VO User Team Alarm Total ALICE ATLAS CMS
Discussions on group meeting
Data, Databases, and DBMSs
Multi VO Rucio Andrew Lister.
Presentation transcript:

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy

Outline  Monitoring strategy  Prototype monitoring tool  (Prototype) operational procedures

New FTS error classification  Current version of FTS 2.0 has reasonable error categorization (much improved over 1.5) – you see it in the current error messages  New dev version of FTS 2.0 has better categorization No schema update – the current FTS 2.0 already has the necessary schema Exposing differing levels of detail:  Scope: basic, determines where the error occurred SOURCE, DESTINATION, TRANSFER  Category: over 30 error classification codes GRIDFTP_ERROR, DEST_PREP_SRM_ERROR, FILE_EXISTS, NO_SPACE_LEFT, TRANSFER_TIMEOUT  Phase: determines when in the file lifecycle the error occurred e.g. ALLOCATION, TRANSFER_PREPARATION, TRANSFER, TRANSFER_FINALIZATION  Message: the error message detail – and pattern matching on the error message (>350 messages so far) 

General strategy  All our monitoring components should agree on these classifications Based on a detailed study of all the errors seen over the last 9 months on CERN-PROD These classifications are being stored in the DB  Summarizations and their visualizations should make use of these, as necessary For seeing where the problems really are Differing levels of detail  Increasing levels of detail should be available if you want to drill down Even with reasonable categories, you (often) have to pattern match the errors to distinguish problems

General strategy  Different people want different views Try to generate different summaries from same data  The ‘raw’ data is being stored in the DB Generated summaries for the different views should be stored in the DB (and then exposed however) As much as possible, the (CPU) processing for these summaries should be in the DB  Make more use of DB’s CPU  PL/SQL based Oracle analytics is rather efficient: this is what Oracle is for But … still prototyping

FTS service monitoring  Different types of monitoring are being prototyped for different purposes  Summary statistics (FTS report) Daily and weekly summary statistics – good for tracking general situation but can’t be use in real time monitoring. Summary statistics  Snapshot summary Snapshot of current status – statistics about current transfers. Is useful to get general view of the situation Snapshot summary  Current transfers Status of the actual transfers submitted during the last hour. Good for tracking certain class of problems, but there is no summarization. Current transfers  Channel settings Current channel settings on the tier-0 service - provide the same information which you can get from the CLI using glite-transfer- interface but in an easier way. Channel settings  Agent status Status and location of channel agent daemons – same as above but provides information about agent daemons Agent status  Log mining monitoring Detailed log-mining tool – later section. Log mining monitoring

 Tracking, summarisation and visualization of problems  Summarisation data is persistent (i.e. history is available)  1 st prototype implementation: Perl+Shell -> MySQL -> PHP+XHTML Development version uses PL/SQL -> Oracle The same local visualisation: PHP + XHTML  Uses standardised error categorisation  Detail available: more than 350 known ‘detail’ errors (pattern matching)  Views: Flexible setting in selecting channel, error, time period and result form.  Presentation result as tables, graphics or charts FTS monitoring prototype

Prototype: current architecture  1 st prototype used log-file analysis  Current prototype gets the information from the DB Procedure: every hour the log files are copied to the web-server where they’re parsed by scripts. All the summary information is written to the DB and is accessible through the web-interface. Procedure: every so often, a PL/SQL job runs summarising the information in the DB (which is the same as that in the logs). All the summary information is written to the DB and is accessible through the web-interface. Selected summary views may be broadcast externally using the mechanisms defined by the WLCG monitoring group.

1 st monitoring prototype 1. Main module provides information about individual errors or grouped by category on all FTS channels. Different time views: latest, last 24 hours, fixed period

1 st monitoring prototype 2. Error monitoring module allows you to filter and get information about only the most interesting errors

1 st monitoring prototype 3. Statistics. Define the channel with biggest number of errors. Ratings of errors by frequency or total quantity on channels.

1 st monitoring prototype 4. Reports. e.g. Top errors on T1-sites, weekly Tier-0 Castor report

Current prototype  New version of system is more integrated into the FTS schema, so we don’t have to parse log files, and has extra features: 2 dimensions of error representation – error categories and specific errors Views for 4 different users group (FTS operations, Storage site admins, VO admins, Management) 5 monitored objects channels, sites, host, errors, VOs General information Advanced admin panel Alarm mechanism...

Current prototype 4. Main interface, alarms, general information, objects, options.

Current prototype 4. Admin interface – easy way to manage objects and change settings!

Current prototype 4. Cross-module links – possibility to drill down. Categories or errors – you choose! CERN-PROD information about categories for CERN-PROD information about errors for 11.24

Current prototype 4. We maintain functionality of previous version, but add extra views! TOP 3 channels with biggest amounts of error – TOP 10 errors for sorted by total amount

Other tools  (Several) other tools exist Developed by T1 sites  Some will be shown today They collect information in various ways They cover different parts of the ‘service monitoring’ areas discussed Different visualisations  Question (for discussion) of how to integrate these ?

Debugging procedures  Have worked out a daily procedure at Tier-0: tracking current situation during the day (e.g. the ‘latest info’ option) helps us to identify new problems on FTS channels Then follow up: e.g. GGUS ticket could be opened or we could try contact with responsible person (but we have only several s)  ‘Top errors’ help us get a overall view on general situation, and also point out major problems SOURCE during PREPARATION phase: [REQUEST_TIMEOUT] failed to prepare source file in 180 seconds DESTINATION during PREPARATION phase: [REQUEST_TIMEOUT] failed to prepare Destination file in 180 seconds the server sent an error response: Can't open data connection. timed out() failed SOURCE during PREPARATION phase: [GENERAL_FAILURE] CastorStagerInterface.c:2162 Required tape segments are not all accessible DESTINATION during PREPARATION phase: [GENERAL_FAILURE] CastorStagerInterface.c:2507 Device or resource busy  By working with monitoring tools and following a daily procedure we can get notion about the evolution of certain errors, tendencies and dependences  Reporting tools provide us information about errors on specific site, so ‘standard’ questions can be answered e.g. reports for Joint Operations Meeting

Operations procedures 1.Daily log – tracking of current problems and open issues 2.Daily log Archive – history of situation on channels from February ‘07 till present time 3.Weekly Report – summary report for the Joint Operations Meeting 4.Weekly Tier-0 – summary of issues noticed on the Castor Tier-0 service

Summary  Strategy is to quickly prototype tools Production versions (ideally) should move into DB  We understand the information we have in the schema and some of the summaries, views and reports we’d like to make from it – need input for more!  CERN and T1 sites developed some prototype tools following this model  We’re working out and testing a set of operational procedures based on these tools  We need input for more!