FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy
Outline Monitoring strategy Prototype monitoring tool (Prototype) operational procedures
New FTS error classification Current version of FTS 2.0 has reasonable error categorization (much improved over 1.5) – you see it in the current error messages New dev version of FTS 2.0 has better categorization No schema update – the current FTS 2.0 already has the necessary schema Exposing differing levels of detail: Scope: basic, determines where the error occurred SOURCE, DESTINATION, TRANSFER Category: over 30 error classification codes GRIDFTP_ERROR, DEST_PREP_SRM_ERROR, FILE_EXISTS, NO_SPACE_LEFT, TRANSFER_TIMEOUT Phase: determines when in the file lifecycle the error occurred e.g. ALLOCATION, TRANSFER_PREPARATION, TRANSFER, TRANSFER_FINALIZATION Message: the error message detail – and pattern matching on the error message (>350 messages so far)
General strategy All our monitoring components should agree on these classifications Based on a detailed study of all the errors seen over the last 9 months on CERN-PROD These classifications are being stored in the DB Summarizations and their visualizations should make use of these, as necessary For seeing where the problems really are Differing levels of detail Increasing levels of detail should be available if you want to drill down Even with reasonable categories, you (often) have to pattern match the errors to distinguish problems
General strategy Different people want different views Try to generate different summaries from same data The ‘raw’ data is being stored in the DB Generated summaries for the different views should be stored in the DB (and then exposed however) As much as possible, the (CPU) processing for these summaries should be in the DB Make more use of DB’s CPU PL/SQL based Oracle analytics is rather efficient: this is what Oracle is for But … still prototyping
FTS service monitoring Different types of monitoring are being prototyped for different purposes Summary statistics (FTS report) Daily and weekly summary statistics – good for tracking general situation but can’t be use in real time monitoring. Summary statistics Snapshot summary Snapshot of current status – statistics about current transfers. Is useful to get general view of the situation Snapshot summary Current transfers Status of the actual transfers submitted during the last hour. Good for tracking certain class of problems, but there is no summarization. Current transfers Channel settings Current channel settings on the tier-0 service - provide the same information which you can get from the CLI using glite-transfer- interface but in an easier way. Channel settings Agent status Status and location of channel agent daemons – same as above but provides information about agent daemons Agent status Log mining monitoring Detailed log-mining tool – later section. Log mining monitoring
Tracking, summarisation and visualization of problems Summarisation data is persistent (i.e. history is available) 1 st prototype implementation: Perl+Shell -> MySQL -> PHP+XHTML Development version uses PL/SQL -> Oracle The same local visualisation: PHP + XHTML Uses standardised error categorisation Detail available: more than 350 known ‘detail’ errors (pattern matching) Views: Flexible setting in selecting channel, error, time period and result form. Presentation result as tables, graphics or charts FTS monitoring prototype
Prototype: current architecture 1 st prototype used log-file analysis Current prototype gets the information from the DB Procedure: every hour the log files are copied to the web-server where they’re parsed by scripts. All the summary information is written to the DB and is accessible through the web-interface. Procedure: every so often, a PL/SQL job runs summarising the information in the DB (which is the same as that in the logs). All the summary information is written to the DB and is accessible through the web-interface. Selected summary views may be broadcast externally using the mechanisms defined by the WLCG monitoring group.
1 st monitoring prototype 1. Main module provides information about individual errors or grouped by category on all FTS channels. Different time views: latest, last 24 hours, fixed period
1 st monitoring prototype 2. Error monitoring module allows you to filter and get information about only the most interesting errors
1 st monitoring prototype 3. Statistics. Define the channel with biggest number of errors. Ratings of errors by frequency or total quantity on channels.
1 st monitoring prototype 4. Reports. e.g. Top errors on T1-sites, weekly Tier-0 Castor report
Current prototype New version of system is more integrated into the FTS schema, so we don’t have to parse log files, and has extra features: 2 dimensions of error representation – error categories and specific errors Views for 4 different users group (FTS operations, Storage site admins, VO admins, Management) 5 monitored objects channels, sites, host, errors, VOs General information Advanced admin panel Alarm mechanism...
Current prototype 4. Main interface, alarms, general information, objects, options.
Current prototype 4. Admin interface – easy way to manage objects and change settings!
Current prototype 4. Cross-module links – possibility to drill down. Categories or errors – you choose! CERN-PROD information about categories for CERN-PROD information about errors for 11.24
Current prototype 4. We maintain functionality of previous version, but add extra views! TOP 3 channels with biggest amounts of error – TOP 10 errors for sorted by total amount
Other tools (Several) other tools exist Developed by T1 sites Some will be shown today They collect information in various ways They cover different parts of the ‘service monitoring’ areas discussed Different visualisations Question (for discussion) of how to integrate these ?
Debugging procedures Have worked out a daily procedure at Tier-0: tracking current situation during the day (e.g. the ‘latest info’ option) helps us to identify new problems on FTS channels Then follow up: e.g. GGUS ticket could be opened or we could try contact with responsible person (but we have only several s) ‘Top errors’ help us get a overall view on general situation, and also point out major problems SOURCE during PREPARATION phase: [REQUEST_TIMEOUT] failed to prepare source file in 180 seconds DESTINATION during PREPARATION phase: [REQUEST_TIMEOUT] failed to prepare Destination file in 180 seconds the server sent an error response: Can't open data connection. timed out() failed SOURCE during PREPARATION phase: [GENERAL_FAILURE] CastorStagerInterface.c:2162 Required tape segments are not all accessible DESTINATION during PREPARATION phase: [GENERAL_FAILURE] CastorStagerInterface.c:2507 Device or resource busy By working with monitoring tools and following a daily procedure we can get notion about the evolution of certain errors, tendencies and dependences Reporting tools provide us information about errors on specific site, so ‘standard’ questions can be answered e.g. reports for Joint Operations Meeting
Operations procedures 1.Daily log – tracking of current problems and open issues 2.Daily log Archive – history of situation on channels from February ‘07 till present time 3.Weekly Report – summary report for the Joint Operations Meeting 4.Weekly Tier-0 – summary of issues noticed on the Castor Tier-0 service
Summary Strategy is to quickly prototype tools Production versions (ideally) should move into DB We understand the information we have in the schema and some of the summaries, views and reports we’d like to make from it – need input for more! CERN and T1 sites developed some prototype tools following this model We’re working out and testing a set of operational procedures based on these tools We need input for more!