CASTOR at RAL in 2016 Rob Appleyard
Contents Current Status Staffing Upgrade plans Questions Conclusion
Current Status
RAL: –Tier 1: four Instances, 13 disk pools, 12PB disk, 14 PB tape ATLAS > LHCb > CMS > ALICE/’Gen’ –Local facilities: Small D0T1 disk instance with large (8PB) tape backend. –Condor batch farm. –Running pretty well. Good availability for last year
Changes: Staffing Staffing –Shaun & Juan have left –Meet George – new CASTOR person –Andrey now sole DBA
Changes: Local Facilities Facilities setup –Disk cache is many small nodes (11*8TiB) Old hardware, but good performance –User wanted to stage large quantities of data… …but it was getting GC-ed before user got around to retrieving it. Sad user. –Too expensive to scale up with many small nodes –Mixing 8TiB old nodes with new big nodes seems like asking for trouble
Changes: Local Facilities Small, high-performance disk cache is great for migration… –…but not for users who want to stage large amounts of data. We don’t want to throw away our migration-optimised cache, so we need to find a way to accommodate recalls.
Changes: Local Facilities The solution: dedicated recall cache. –Few large nodes, total capacity bigger than max anticipated user recall. –Conventional d0t1 GC Now have generic migration cache and 2 recall caches for specific users. Possibly not necessary to have 2. Works OK for now. –Is there anything we should be aware of? User wants to run D1T0 and manage his own deletion.
Changes: Tier 1 …not a lot, actually –Run 2 well underway –Availability generally good – Real Soon Now™
DB problem turned out to be due to a missing DB link. –…and now the test instance (mostly) works! xroot & rfio access all OK SRM access not working… but lcg-del does. Have not investigated in any detail due to lack of time this week… ~]$ /usr/bin/lcg-cp --vo dteam --defaultsetype srmv2 --nobdii -S PreprodDiskPool srm:// /rob/junk file:/home/tier1/rvv47345/recall [SE][StatusOfGetRequest][ETIMEDOUT] httpg:// User timeout over lcg_cp: Connection timed out
Tape Servers on all production tape servers Some issues, Tim to report by . Major hardware issues with one library Software issues with ACSLS Roadmap: RH7-based tape servers
Hardware Most new hardware allocated to Echo project –2011-generation nodes still in tape-backed service classes feeling a bit creaky –New hardware acquired to fill the gaps –Help us keep up with LHC production
Tape Robot Problems Two periods of difficult running - early May & early June. Consult with Tim for full story Both libraries (Tier 1 & ‘Facilities’) offline at some point, Tier 1 for longer. Early May: both elevators in Tier 1 robot failed –Moved drives into Facilities robot to ensure migration Early June: Engineer addressing previous problem received electric shock from robot – robot turned off until confirmed safe
Future Plans SL7 tape servers …and Ceph gateways Echo migration… –More on this later. –Outline: 1)Progressively migrate disk-only CASTOR storage to Echo in co-ordination with VOs 2)Keep D0T1 CASTOR going. 3)See talk from last time for further detail (‘CASTOR 2017’)
Assorted questions from RAL 1.Understood that rfio is being removed. Any estimate of when will this happen? 2.Is there any possibility of running a non-Ceph object store (DDN/Panasas) beneath CASTOR? –Question from a curious RAL user, motivation unclear. 3.What access protocols will CASTOR support when running on top of Ceph storage?