Download presentation
Presentation is loading. Please wait.
Published byFlora Carr Modified over 8 years ago
1
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DSS Tape Monitoring Vladimír Bahyl IT DSS TAB Storage Analytics Seminar February 2011
2
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 2 Overview From low level –Tape drives; libraries Via middle layer –LEMON –Tape Log DB To high level –Tape Log GUI –SLS TSMOD What is missing? Conclusion
3
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 3 Low level – towards the vendors Oracle Service Delivery Platform (SDP) –Automatically opens tickets with Oracle –We also receive notifications –Requires “hole” in the firewall, but quite useful IBM TS3000 console –Central point collecting all information from 4 (out of 5) libraries –Call home via Internet (not modem) –Engineers come on site to fix issues
4
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 4 Low level – CERN usage SNMP –Using it (traps) whenever available –Need MIB files with SNMPTT actuators: –IBM libraries send traps on errors –ACSLS sends activity traps ACSLS –Event log messages on multiple lines concatenated into one –Forwarded via syslog to central store –Useful for tracking issues with library components (PTP) EVENT ibm3584Trap004.1.3.6.1.4.1.2.6.182.1.0.4 ibm3584Trap CRITICAL FORMAT ON_BEHALF: $A SEVERITY: '3' $s MESSAGE: 'ASC/ASCQ $2, Frame/Drive $6, $7' EXEC /usr/local/sbin/ibmlib-report-problem.sh $A CRITICAL NODES ibmlib0 ibmlib1 ibmlib2 ibmlib3 ibmlib4 SDESC Trap for library TapeAlert 004. DESC
5
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 5 Middle layer – LEMON Actuators constantly check local log files 4 situations covered: 1.Tape drive not operational 2.Request stuck for at last 3600 seconds 3.Cartridge is write protected 4.Bad MIR (Media Information Record) Ticket is created = email is sent –All relevant information is provided within the ticket to speedup the resolution Workflow is followed to find a solution Dear SUN Tape Drive maintainer team, this is to report that a tape drive T10B661D@tpsrv963 has became non-operational. Tape T05653 has been disabled. PROBABLE ERRORS 01/28 15:33:05 10344 rlstape: tape alerts: hardware error 0, media error 0, read failure 0, write failure 0 01/28 15:33:05 10344 chkdriveready: TP002 - ioctl error : Input/output error 01/28 15:33:05 10344 rlstape: TP033 - drive T10B661D@tpsrv963.cern.ch not operational IDENTIFICATION Drive Name: T10B661D Location: acs0,6,1,13 Serial Nr: Volume ID: T05653 Library: SL8600_1 Model: T10000 Producer: STK Density: 1000GC Free Space: 0 Nb Files: 390 Status: FULL|DISABLED Pool Name: compass7_2 Tape Server: tpsrv963
6
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 6 Middle layer – Tape Log DB CASTOR log messages from all tape servers are processed and forwarded to central database Allows correlation of independent errors (not a complete list): –X input/output errors with Y tapes on 1 drive –X write errors on Y tapes on 1 drive –X positioning errors on Y tapes on 1 drive –X bad MIRs for 1 tape on Y drives –X write/read errors on 1 tape on Y drives –X positioning errors on 1 drive on Y drives –Too many errors on a library Archive for 120 days all logs slit by VID and tape server –Q: What happened to this tape?
7
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 7 Tape Log – the data Origin: rtcpd & taped log messages –All tape servers sending data in parallel Content: various file state information Volume: –Depends on the activity of the tape infrastructure –Past 7 days: ~30 GBs of text files (raw data) Frequency: –Depends on the activity of the tape infrastructure –Easily > 1000 lines / second Format: plain text
8
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 8 Tape Log – data transport Protocol: ( r)syslog log messages Volume: ~150 KB/second Accepted delays: YES/NO –YES: If the tape log server can not upload processed data into the database, it will try later as it has local text log file –NO: If the rsyslog daemon is not running the the tape log server, lost messages will not be processed Losses acceptable: YES (to some small extent) –The system is only used for statistics or slow reactive monitoring –Serious problem will reoccur elsewhere –We use TCP in order not to loose messages
9
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 9 Tape Log – data storage Medium: Oracle database Data structure: 3 main tables –Accounting –Errors –Tape history Amount of data in store: –2 GB –15-20 millions of records (2 years worth of data) Aging: no, data kept forever
10
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 10 Tape Log – data processing No additional post processing, once data is stored in database Data mining and visualization done online –Can take up to a minute
11
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 11 High level – Tape Log GUI Oracle APEX on top of data in DB Trends –Accounting –Errors –Media issues Graphs –Performance –Problems http://castortapeweb
12
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 12 High level – Tape Log GUI
13
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 13 Tape Log – pros and cons Pros –Used by DG in his talk! –Using standard transfer protocol –Only uses in-house supported tools –Developed quickly; requires little/no support Cons –Charting limitations Can live with that; see point 1 – not worth supporting something special –Does not really scale OK if only looking at last year’s data
14
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 14 High level – SLS Service view for users Life availability information as well as capacity/usage trends –Partially reuses Tape Log DB data Information organized per VO –Text and graphs Per day/week/month
15
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 15 High level – SLS
16
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 16 TSMOD Tape Service Manager on Duty –Weekly changing role to Resolve issues Talk to vendors Supervise interventions Acts on twice-daily summary e-mail which monitors: –Drives stuck in (dis-)mounting –Drives not in production without any reason –Requests running or queued for too long –Queue size too long –Supply tape pools running low –Too many disabled tapes since the last run Goal: have one common place to watch
17
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 17 What is missing? We often need the full chain –When was the tape last time successfully read? –On which drive? –What was the firmware of that drive? Users hidden within upper layers –We do not know which exact user is right now reading/writing –The only information we have is the experiment name and that is deducted from the stager hostname Details investigations often require request ID
18
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 18 Conclusion CERN has extensive tape monitoring covering all layers The monitoring is fully integrated with the rest of the infrastructure It is flexible to support new hardware (e.g. higher capacity media) The system is being improved as new requirements arise
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.