CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN Tape Status Tape Operations Team IT/FIO CERN.

Slides:



Advertisements
Similar presentations
Networking Essentials Lab 3 & 4 Review. If you have configured an event log retention setting to Do Not Overwrite Events (Clear Log Manually), what happens.
Advertisements

MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.
2.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 2: Installing Windows Server.
Chapter 7: Configuring Disks. 2/24 Objectives Learn about disk and file system configuration in Vista Learn how to manage storage Learn about the additional.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
Backup Rationalisation Reorganisation of the CERN Computer Centre Backups David Asbury IT/DS Friday 6 December 2002.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Chapter-4 Windows 2000 Professional Win2K Professional provides a very usable interface and was designed for use in the desktop PC. Microsoft server system.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Ch 6. Performance Rating Windows 7 adjusts itself to match the ability of the hardware –Aero Theme v. Windows Basic –Gaming features –TV recording –Video.
Module 7: Fundamentals of Administering Windows Server 2008.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
RAL Site Report Castor F2F, CERN Matthew Viljoen.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Tape Monitoring Vladimír Bahyl IT DSS TAB Storage Analytics.
CERN - IT Department CH-1211 Genève 23 Switzerland The Tier-0 Road to LHC Data Taking CPU ServersDisk ServersNetwork FabricTape Drives.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
CERN IT Department CH-1211 Genève 23 Switzerland PES SVN User Forum David Asbury Alvaro Gonzalez Alvarez Pawel Kacper Zembrzuski 16 April.
Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN - IT Department CH-1211 Genève 23 Switzerland Tier-0 CCRC’08 May Post-Mortem Miguel Santos Ricardo Silva IT-FIO-FS.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS New tape server software Status and plans CASTOR face-to-face.
CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.
CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Data architecture challenges for CERN and the High Energy.
01. December 2004Bernd Panzer-Steindel, CERN/IT1 Tape Storage Issues Bernd Panzer-Steindel LCG Fabric Area Manager CERN/IT.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Developments for tape CERN IT Department CH-1211 Genève 23 Switzerland t DSS Developments for tape CASTOR workshop 2012 Author: Steven Murray.
CERN IT Department CH-1211 Genève 23 Switzerland t Increasing Tape Efficiency Original slides from HEPiX Fall 2008 Taipei RAL f2f meeting,
Tape archive challenges when approaching Exabyte-scale CHEP 2010, Taipei G. Cancio, V. Bahyl, G. Lo Re, S. Murray, E. Cano, G. Lee, V. Kotlyar CERN IT-DSS.
Virtual Server Server Self Service Center (S3C) JI July.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
CASTOR: possible evolution into the LHC era
Integrating Disk into Backup for Faster Restores
The Beijing Tier 2: status and plans
Tape Drive Testing IBM 3592.
Robotics and Tape Drives
Overview of the Belle II computing
Tape Drive Testing.
Tape Operations Vladimír Bahyl on behalf of IT-DSS-TAB
The Unbearable Slowness of Tape
Experiences and Outlook Data Preservation and Long Term Analysis
Enrico Fattibene CDG – CNAF 18/09/2017
Update on Plan for KISTI-GSDC
Luca dell’Agnello INFN-CNAF
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
Ákos Frohner EGEE'08 September 2008
Upgrading to Microsoft SQL Server 2014
THE HP AUTORAID HIERARCHICAL STORAGE SYSTEM
Operating Systems Chapter 5: Input/Output Management
High Performance Storage System
Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland CERN Tape Status Tape Operations Team IT/FIO CERN

CERN - IT Department CH-1211 Genève 23 Switzerland Agenda Hardware Low Level Tape Software Tape Logging and Metrics Repack Experience and Tools Some Thoughts on the Future 2

CERN - IT Department CH-1211 Genève 23 Switzerland Hardware Migration of all drives to 1TB to be completed by end Q –60 IBM Jaguar3 (700GB to 160MB/s) –70 T10K B (500GB to 130MB/s) IBM High Density Robot in production –Improve GB/m 2 by a factor of 3 Move to blade based tape servers –Improve power efficiency –Reduce unused memory and CPU Visited IBM and Sun US labs in Q4/2008 –Tape market is strong especially with new regulatory requirements 3

CERN - IT Department CH-1211 Genève 23 Switzerland IBM HD Frame 4 Testing IBM TS3500 S24 High Density frame – 4 T00T01 T02 T03 T : Cache

CERN - IT Department CH-1211 Genève 23 Switzerland Low Level Tape Changes During past 6 months, the following changes have been included into the X tape server. –blank tape detection has been improved and CERN now uses tplabel –st/sg mapping has been fixed to cope with driver removal and reinsertion (which changed the order in /proc/scsi/scsi) –large messages from network security scans do not crash the SCSI media changer rmc daemon anymore –added an option to ignore errors on unload –added an option to detect and abort too long position (where driver return status OK, but positions the tape incorrectly and are hence suspected to overwrite data) –added support for the 1000GC tapes –added central syslog logging to rmc daemon for central tape logging CERN moved tape servers to last week –Running well with Castor stagers 5

CERN - IT Department CH-1211 Genève 23 Switzerland Tape Log Database Provides a central log of all tape related messages and performance –LHCC Metrics to SLSto SLS Number of drives / VO File sizes... –Problem investigations When was this tape mounted recently ? Has this drive reported errors ?... –Automated problem reporting and action Library failure Tape or drive disable... –GUI for data visualisation (Work In Progress) Graphs Annotate comments such as ‘sent tape for repair’ Note: This is a CERN tape operations tool rather than part of the Castor development deliverable. The source code is available if other sites want to use it for inspiration but no support is available. 6

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – Configuration Dedicated castor instance for repack –Isolated load from other stagers following experience sharing with public instance –Simplified debugging –Allowed easier intervention planning and upgraded Configuration –20 dedicated disk servers (12 TB raw each) –150 service classes –Single headnode –No LSF –Dedicated policies, service classes and tape pools for repack to ensure no mixing of data between pools 22,000 tapes at start of intensive run in December 2008 with Castor

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – At Work 13

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – File Sizes 14 Outlook is better than we thought in 2008 since no data taking so far Basic problem remains that file sizes, especially legacy files, are small so slow write performance

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – Progress So Far 15 Projections based on 20 drives used as this was the most we felt we could use while also doing data taking Repack 60% of the tapes would take a year Completion unlikely before next tape drive model Actual repack progress is close to projected but.....

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – Drive Usage 16 T wice as many drives as planned for projected rate Not clear why data rate is slower Luckily, demand is low currently but expected to increase soon

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – Success Rate 17 Success percentage is defined as the number of tapes which can be repacked 1 st go without any human intervention Around 1 in 5 tapes fails to repack completely Stalled repacks when streams became blocked Multiple copy tapes Bad name server file sizes compared to tape segments Media/tape errors are occasional

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – Difficult Case Example Repack of tape gives on 1 failed file Check logs which indicates a difference between name server and tape segment file size Recall the file from tape using tpread and check size and checksum Set the file size in name server and clear the checksum.. Is it the right file/owner ? Repack the tape Stager reports bad file size Fix the file size in the stager and remove staged copy using SQL scripts Repack the tape, file still does not migrate Report issue to development sr#107802sr# Manually stage file completed OK... –5 hours work to recover one file... 18

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – Thoughts Repack has been an effective stress test for Castor, finding many issues in the name server, stager and tape stream handling The basic repack engine requires regular surveillance to keep busy but not too over-loaded. –5K LOC of scripts to select, submit, reclaim and try to guess the failure causes. A small part of this function is now included in the repack server –Keeping streams balanced across pools to avoid long queues, device group hot spots and user starvation –1 FTE required to tweak the load, analyse the problems and clean up failed repacks Selecting the large file tapes first has helped –Larger files to get good data rates –A 10,000 file tape can take several hours to get started We’ve been able to benefit from the delayed start up but it is not likely to continue so quietly in the next months 19

CERN - IT Department CH-1211 Genève 23 Switzerland Crystal Ball – 2009/ Main Goals for the year 1TB everywhere Repack as much as we can SLC5 Finish phase out Sun 9940B infrastructure Sun T10000A drives IBM Jaguar 2 (upgrade to Jaguar 3) Extended run is planned to produce 30PB Need 15,000 new library slots+media Installation must be non disruptive if after September Tape Operations team will be reduced to 2 FTE in 2010 Was 4 FTE in 2008

CERN - IT Department CH-1211 Genève 23 Switzerland Further out / Expecting 15 PB/year Upgrade remaining IBM libraries to HD frames How to do it while still full of tapes ? New contract for tape robotics New computer centre New drives, new media, same libraries ? LTO-5 ? Media Re-use ? Repack 50,000 cartridges (and re-sell if possible) ? Or buy more libraries ?

CERN - IT Department CH-1211 Genève 23 Switzerland Conclusions 2008 saw major changes in drive technology which are now completed CERN tape infrastructure is ready for data taking this year Getting bulk repack into production has been hard work but should benefit overall Castor stability Repacking to completion seems very unlikely during 2009 and will have to compete with experiments for drive time Continued pressure on staffing forces continued investment in automation 22

CERN - IT Department CH-1211 Genève 23 Switzerland Backup Slides

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – 2008 theory 24 aul,max80 corresponds to AUL label format with 80MB/s read rate, around 3 years to complete

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – Read Drive Usage 25

CERN - IT Department CH-1211 Genève 23 Switzerland Repack – Write Drive Usage 26

CERN - IT Department CH-1211 Genève 23 Switzerland Free Space Trends 27