CASTOR at RAL in 2016 Rob Appleyard. Contents Current Status Staffing Upgrade plans Questions Conclusion.

Slides:

Advertisements

Similar presentations

RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.

Advertisements

12th September 2002Tim Adye1 RAL Tier A Tim Adye Rutherford Appleton Laboratory BaBar Collaboration Meeting Imperial College, London 12 th September 2002.

EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.

19 February CASTOR Monitoring developments Theodoros Rekatsinas, Witek Pokorski, Dennis Waldron, Dirk Duellmann,

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\

RAL Site Report Castor F2F, CERN Matthew Viljoen.

Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.

GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

CERN Physics Database Services and Plans Maria Girone, CERN-IT

Your university or experiment logo here CEPH at the Tier 1 Brain Davies On behalf of James Adams, Shaun de Witt & Rob Appleyard.

RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.

Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.

Outline: Status: Report after one month of Plans for the future (Preparing Summer -Fall 2003) (CNAF): Update A. Sidoti, INFN Pisa and.

Tier-1 Andrew Sansum Deployment Board 12 July 2007.

Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.

Storage Classes report GDB Oct Artem Trunov

SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.

SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.

CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.

Elasticsearch – An Open Source Log Analysis Tool Rob Appleyard and James Adams, STFC Application-Level Logging for a Large Tier 1 Storage System.

WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.

Tier-1 Data Storage Challenges Extreme Data Workshop Andrew Sansum 20 th April 2012.

15.June 2004Bernd Panzer-Steindel, CERN/IT1 CERN Mass Storage Issues.

Virtual Memory (Section 9.3). The Need For Virtual Memory Many computers don’t have enough memory in RAM to accommodate all the programs a user wants.

Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.

Dynamic Extension of the INFN Tier-1 on external resources

Planning for Application Recovery

Title of the Poster Supervised By: Prof.*********

London Tier-2 Quarter Owen Maroney

Tom Byrne, Bruno Canning

CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team

LCG Service Challenge: Planning and Milestones

Robotics and Tape Drives

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

Flavia Donno CERN GSSD Storage Workshop 3 July 2007

Service Challenge 3 CERN

1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20

Enrico Fattibene CDG – CNAF 18/09/2017

Taming the protocol zoo

SRM2 Migration Strategy

CASTOR-SRM Status GridPP NeSC SRM workshop

SWITCHdrive Experience with running Owncloud on top of Openstack/Ceph

Luca dell’Agnello INFN-CNAF

WLCG Management Board, 16th July 2013

Olof Bärring LCG-LHCC Review, 22nd September 2008

1 VO User Team Alarm Total ALICE ATLAS CMS

Thoughts on Computing Upgrade Activities

Monitoring at a Multi-Site Tier 1

Ákos Frohner EGEE'08 September 2008

Workshop Summary Dirk Duellmann.

WLCG Service Report 5th – 18th July

The INFN Tier-1 Storage Implementation

Take the summary from the table on

CTA: CERN Tape Archive Overview and architecture

Universita’ di Torino and INFN – Torino

OffLine Physics Computing

Scientific Computing At Jefferson Lab

Australia Site Report Sean Crosby DPM Workshop – 13 December 2013.

Hardware Main memory 26/04/2019.

The LHCb Computing Data Challenge DC06

Presentation transcript:

CASTOR at RAL in 2016 Rob Appleyard

Contents Current Status Staffing Upgrade plans Questions Conclusion

Current Status

RAL: –Tier 1: four Instances, 13 disk pools, 12PB disk, 14 PB tape ATLAS > LHCb > CMS > ALICE/’Gen’ –Local facilities: Small D0T1 disk instance with large (8PB) tape backend. –Condor batch farm. –Running pretty well. Good availability for last year

Changes: Staffing Staffing –Shaun & Juan have left –Meet George – new CASTOR person –Andrey now sole DBA

Changes: Local Facilities Facilities setup –Disk cache is many small nodes (11*8TiB) Old hardware, but good performance –User wanted to stage large quantities of data… …but it was getting GC-ed before user got around to retrieving it. Sad user. –Too expensive to scale up with many small nodes –Mixing 8TiB old nodes with new big nodes seems like asking for trouble

Changes: Local Facilities Small, high-performance disk cache is great for migration… –…but not for users who want to stage large amounts of data. We don’t want to throw away our migration-optimised cache, so we need to find a way to accommodate recalls.

Changes: Local Facilities The solution: dedicated recall cache. –Few large nodes, total capacity bigger than max anticipated user recall. –Conventional d0t1 GC Now have generic migration cache and 2 recall caches for specific users. Possibly not necessary to have 2. Works OK for now. –Is there anything we should be aware of? User wants to run D1T0 and manage his own deletion.

Changes: Tier 1 …not a lot, actually –Run 2 well underway –Availability generally good – Real Soon Now™

DB problem turned out to be due to a missing DB link. –…and now the test instance (mostly) works! xroot & rfio access all OK SRM access not working… but lcg-del does. Have not investigated in any detail due to lack of time this week…  ~]$ /usr/bin/lcg-cp --vo dteam --defaultsetype srmv2 --nobdii -S PreprodDiskPool srm://lcgsrm08.gridpp.rl.ac.uk:8443/srm/managerv2?SFN=/castor/preprod.ral/preprodDisk /rob/junk file:/home/tier1/rvv47345/recall [SE][StatusOfGetRequest][ETIMEDOUT] httpg://lcgsrm08.gridpp.rl.ac.uk:8443/srm/managerv2: User timeout over lcg_cp: Connection timed out

Tape Servers on all production tape servers Some issues, Tim to report by . Major hardware issues with one library Software issues with ACSLS Roadmap: RH7-based tape servers

Hardware Most new hardware allocated to Echo project –2011-generation nodes still in tape-backed service classes feeling a bit creaky –New hardware acquired to fill the gaps –Help us keep up with LHC production

Tape Robot Problems Two periods of difficult running - early May & early June. Consult with Tim for full story Both libraries (Tier 1 & ‘Facilities’) offline at some point, Tier 1 for longer. Early May: both elevators in Tier 1 robot failed –Moved drives into Facilities robot to ensure migration Early June: Engineer addressing previous problem received electric shock from robot – robot turned off until confirmed safe

Future Plans SL7 tape servers …and Ceph gateways Echo migration… –More on this later. –Outline: 1)Progressively migrate disk-only CASTOR storage to Echo in co-ordination with VOs 2)Keep D0T1 CASTOR going. 3)See talk from last time for further detail (‘CASTOR 2017’)

Assorted questions from RAL 1.Understood that rfio is being removed. Any estimate of when will this happen? 2.Is there any possibility of running a non-Ceph object store (DDN/Panasas) beneath CASTOR? –Question from a curious RAL user, motivation unclear. 3.What access protocols will CASTOR support when running on top of Ceph storage?