RAL Site Report Martin Bly SLAC – October 2005
11-13 October 2005 RAL Site Report - SLAC Overview Intro Hardware OS/Software Services Issues
11-13 October 2005 RAL Site Report - SLAC RAL T1 Rutherford Appleton Lab hosts the UK LCG Tier-1 –Funded via GridPP project from PPARC –Supports LCG and UK Particle Physics users VOs: –LCG: Atlas, CMS, LHCb, (Alice), dteam –Babar –CDF, D0, H1, Zeus –Bio, Pheno Expts: –Minos, Mice, SNO, UKQCD Theory users …
11-13 October 2005 RAL Site Report - SLAC Tier 1 Hardware ~950 CPUs in batch service –1.4GHz, 2.66GHz, 2.8GHz – P3 and P4/Xeon (HT off) –1.0GHz systems retiring as they fail, phase out end Oct '05 –New procurement Aiming for SPECint2000/CPU Systems for testing as part of evaluation of tender First delivery early '06, second delivery in April/May '06 ~40 systems for services (FEs, RB, CE, LCG servers, loggers etc) 60+ disk servers –Mostly SCSI attached IDE or SATA ~220TB unformatted –New procurement: probably PCI/SATA solution Tape robot –6K slots, 1.2PB, 10 drives
11-13 October 2005 RAL Site Report - SLAC Tape Robot / Data Store Current data: 300TB, PP -> 200+TB (110TB Babar) Castor 1 system trials –Many CERN-specifics HSM (Hierarchical Storage Manager) –500TB, DMF (Data Management Facility) SCSI/FC Real file system Data migrates to tape after inactivity Not for PP data –Due November 05 Procurement for a new robot underway –3PB, ~10 tape drives –Expect to order end Oct 05 –Delivery December 05 –In service by March 06 (for SC4) –Castor system
11-13 October 2005 RAL Site Report - SLAC Networking Tier-1 backbone at 4x1Gb/s –Upgrading some links to 10Gb/s Multi-port 10Gb/s layer-2 switch stack as hub when available 1Gb/s production link Tier-1 to RAL site 1Gb/s link to SJ4 (internet) –1Gb/s HW firewall Upgrade site backbone to 10Gb/s expected late '05, early '06 –Link Tier-1 to site at 10Gb/s – possible mid-2006 –Link site to 10Gb/s – mid '06 Site firewall remains an issue – limit 4Gb/s 2x1Gb/s link to UKLight –Separate development network in UK –Links to 2Gb/s, 1Gb/s (pending) –Managed ~90MB/s during SC2, less since Problems with small packet loss causing traffic limitations –Tier-1 to UKLight upgrade to 4x1Gb/s pending,10Gb/s possible –UKLight link to CERN 4Gb/s for early '06 –Over-running hardware upgrade (4 days expanded to 7 weeks)
11-13 October 2005 RAL Site Report - SLAC Tier1 Network Core – SC3 7i-1 7i-3 Router A UKLight Router ADS Caches dCache pools dCache pools Gridftp servers Non-SC hosts Non-SC hosts 4 x 1Gb/s 4 x 1Gb/s 4 x 1Gb/s 2 x 1Gb/s 2 x 1Gb/s to CERN 290Mb/s to Lancaster N x 1Gb/s FW 1Gb/s to SJ4 RAL Site
11-13 October 2005 RAL Site Report - SLAC OS/Software Main services: –Batch, FEs, CE, RB… : SL3 (3.0.3, 3.0.4, 3.0.5) LCG 2_6_0 Torque/MAUI 1 Job/CPU –Disk: RH72 custom, RH73 custom –Some internal services on SL4 (loggers) –Project to use SL4.n for disk servers underway Solaris disk servers decommissioned –Most hardware sold AFS on AIX –Transarc –Project to move to Linux (SL3/4)
11-13 October 2005 RAL Site Report - SLAC Services (1) - Objyserv Objyserv database service (Babar) –Old service on traditional NFS server Custom NFS, heavily loaded, unable to cope with increased activity on batch farm due to threading issues in server Additional server solution with same technology not tenable –New service: Twin ams-based servers, 2 CPUs, HT on, 2 GB RAM SL3, RAID1 data disks 4 servers per host system –Internal redirection using iptables to different server ports depending which of 4 IP addressed used to make the connection Able to cope with some ease: 600+ clients Contact Chris Brew
11-13 October 2005 RAL Site Report - SLAC Services (2) Home file system Home file system migration –Old system: ~85GB on A1000 RAID array Sun Ultra10, Solaris 2.6, 100Mb/s NIC Failed to cope with some forms of pathological use –New system: ~270GB SCSI RAID5, 6 disk chassis 2.4GHz Xeon, 1GB RAM, 1Gb/s NIC SL3, ext3 Stable under I/O and quota testing, and during backup –Migration: 3 weeks planning 1 week of nightly rsync followed by checksuming –Convince ourselves the rsync works 1 day farm shutdown to migrate 1 single file detected to have checksum error –Quotas for users unchanged… –Keep the old system on standby to restore its backups
11-13 October 2005 RAL Site Report - SLAC Services (3) – Batch Server Catastrophic disk failure on Saturday late evening over a holiday weekend –Staff not expected back till 8:30am Wednesday Problem noted Tuesday morning –Initial inspection - disk a total failure –No easy access to backups Backup tape numbers in logs on failed disk! –No easy recovery solution with no other system staff available –Jobs appear happy – terminating OK, sending sandboxes to gatekeeper etc. But no accounting data, no new jobs started. Wednesday: –Hardware `revised’ with two disks, Software RAID1, clean install of SL3 –Backups located, batch/scheduling configs recovered from tape store –System restarted with MAUI off to allow Torque to sort itself out Queues came up closed –MAUI restarted –Service picked up smoothly Lessons: –Know where the backups are and how to identify which tapes are the right ones –Unmodified batch workers are not good enough for system services
11-13 October 2005 RAL Site Report - SLAC Issues How to run resilient services on non-resilient hardware? –Committed to run 24x365, 98%+ uptime –Modified batch workers with extra disks and HS caddies as servers –Investigating HA-Linux Batch server and scheduling experiments positive RB,CE, BDII, R-GMA … –Databases Building services maintenance –Aircon, power Already two substantial shutdowns in 2006 New building UKLight is a development project network –There have been problems with managing expectations for production services on a development network Unresolved packet loss in CERN-RAL transfers –Under investigation 10Gb/s kit expensive –Components we would like are not yet affordable/available –Pushing against LCG turn-on date
Questions?