CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN
CERN - IT Department CH-1211 Genève 23 Switzerland Agenda Progress on issues (since the last meeting) Current equipment and challenges Development changes Operational changes Conclusion 2
CERN - IT Department CH-1211 Genève 23 Switzerland Progress on issues NI_FAILURE –Problem still present –Simple procedure exist = no need to reinstall tplabel command –By default, existing labels are not overwritten –– f option introduced to force relabelling Cmonitd –No longer used at CERN 3
CERN - IT Department CH-1211 Genève 23 Switzerland Equipment today 25 PB total (around 50% free) IBM –2 libraries –~ slots; 700 GB each –60 TS1120 drives Sun –4 libraries –~ slots; 500 GB each –60 T10000A drives 4
CERN - IT Department CH-1211 Genève 23 Switzerland Equipment near future Tape space sufficient for 2008 –Unbalanced New drives –IBM TS1130: ~160 MB/s, 1 TB cartridges –Sun T10000B: ~130 MB/s, 1 TB cartridges IBM High density frame 5
CERN - IT Department CH-1211 Genève 23 Switzerland Challenges Atlas write low rate partially caused by additional mounts due to a CASTOR policy bug Alice rate affected by small files from users writing to default pool 6
CERN - IT Department CH-1211 Genève 23 Switzerland Development 1/3 Patch free kernel version ( ) –Goal: by SLC5 do not use any CASTOR specific kernel patches –All necessary settings moved to CASTOR tape layer –New SCSI tape driver options introduced: TAPE ST_ASYNC_WRITES 0 TAPE ST_BUFFER_WRITES 0 TAPE ST_LONG_TIMEOUT 3600 TAPE ST_READ_AHEAD 0 TAPE ST_TIMEOUT 900 –Testing on few machines already on SLC4 7
CERN - IT Department CH-1211 Genève 23 Switzerland Development 2/3 Library failure handling ( ) –Now possible to overcome short temporary failures of Sun libraries –Options introduced: TAPE ACS_MOUNT_LIBRARY_FAILURE_HANDLING retry TAPE ACS_UNMOUNT_LIBRARY_FAILURE_HANDLING retry Use non-labeled tapes ( ) –By default, we use AUL ( ) tape labels –NL tapes are now also supported 8 American National Standard label and American National Standard user label
CERN - IT Department CH-1211 Genève 23 Switzerland Development 3/3 Option to log to SysLog ( ) –See the talk of Giuseppe Lo Re –Can log to DLF since the last meeting –SysLog now also supported Uses local0 and local1 facilities –Options needed: TAPE TPLOGGER SYSLOG local0.*;local1.* /var/log/castor-tape.log –Log example: Jun 6 15:52:23 tpsrv623 rtcpd[16828]: "TYPE"="RT044 – Request statistics", "FUNC"="rtcpd_FreeResources", "MESSAGE"="Request statistics", "REQUESTTYPE"="READ", "VID"="T07106", "MOUNTTIME"="163", "SERVICETIME"="209", "WAITTIME"="164“, "TRANSFERTIME"="7", "POSITIONTIME"="36", "DATAVOLUMEMB"=" ", "DATARATEMBS"=" ", "FILES"="1", "DGN"="T10KR1", "VOLREQID"="77219", "CLIENTNAME"="stage”, "CLIENTUID"="14029", "CLIENTGID"="1474", "CLIENTHOST"="c2publicsrv102.cern.ch", "TPVID"="T07106", "REQUESTSTATE"="successful“ 9
CERN - IT Department CH-1211 Genève 23 Switzerland Operational changes 1/2 RTCPD self monitor enabled –RTCP daemon sometimes gets stuck –Self monitor terminates the job and does proper cleanup RTCOPYD SELF_MONITOR YES RTCOPYD MOUNT_TIME 900 SNMP traps handling –IBM libraries send SNMP traps directly Volser CLN168JA, A Enterprise Tape cleaning cartridge has expired. –ACSLS sends traps on behalf of Sun libraries ACSLS info Lsm 0,7 number of drives changed from 6 to 7. Lsm will be updated. –LEMON creates alarms 10
CERN - IT Department CH-1211 Genève 23 Switzerland Operational changes 2/2 TSMOD (Tape Service Manager on Duty) –Receives daily report TD01E | Drive Down Without Reason | DN 3592B2 DOWN (No_dedication) None TD03E | Job running for too long | DA 994BR0 RUNNING (No_dedication) P17080 P17080 R TQ01E | DGN Queue Wait Time Long | Average queue wait time in T10KR1 is seconds TQ02E | Queue Request Too Old | Q T10KR1 T13388 R –Follows procedures according to the error code –Handles most other common issues E.g. contacting vendors for problems –Weekly rotation 11
CERN - IT Department CH-1211 Genève 23 Switzerland Conclusion Tape capacity sufficient for 2008 New tape related CASTOR features are constantly being put into production We are trying to simplify our setup and automate the problem handling 12