Presentation is loading. Please wait.

Presentation is loading. Please wait.

RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL.

Similar presentations

Presentation on theme: "RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL."— Presentation transcript:

1 RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL

2 02/05/2011RAL Site Report - HEPiX Spring 2011 Overview General Hardware Storage Networking …

3 General New CEO for STFC –John Womersley takes over from Keith Mason on 1st November –To 31 st March 2015 Staffing @ Tier1 –5 staff posts open due to staff moving –Replacements agreed despite restrictions –Recruitments underway Power –‘Partial Discharge’ (arcing) detected in 11kV bus in transformer room –Isolated to the join between two bus segments (bus-coupler) –Loose bolt in bus bar identified and tightened up – fixed 02/05/2011RAL Site Report - HEPiX Spring 2011

4 Hardware changes Summary of previous report: –13 x Dell R610 tape servers (10GbE) for T10KC drives –14 x T10KC tape drives –Arista 7124S 24-port 10GbE switch + twinax copper interconnects –5 x Avaya 5650 switches + various 10/100/1000 switches New since May –Various Dell R510s for small data servers for Facilities Data Service, provides interfaces into Castor for RAL site facilities and others. –68 x 40TB 4U servers ordered for capacity storage – two suppliers 10GbE, 2TB HDD, single CPU, 24GB RAM, 2.66PB total Note that disks may be hard to get  –15,000 HEP-SPEC tender completed evaluation, result just announced To come –40GbE/10GbE and 10Gbe/1GbE switches, management switches, more tape servers, T10KC tape drives and tapes, iSCSI arrays,... Gone: 22 x 10TB servers - 2005 generation To go: 86 x 6TB servers – 2006 generation 02/05/2011RAL Site Report - HEPiX Spring 2011

5 Storage Issues Issue with some 3ware controllers throwing perfectly healthy WD drives –Due to firmware not recognising and handling failure mode on newer WD drives of the same model –Firmware update has fixed this, rollout completed Issue with Adaptec controllers and StorageManager software –SM reports many SMART errors when drives are healthy reports unhealthy ones too –Firmware update has fixed this, rolling out shortly Problem with T10KC drives –Early production batch issue –Firmware fix –No recurrence Production storage now using most recent sets of hardware with older (smaller capacity) hardware ‘spinning reserve’ 02/05/2011RAL Site Report - HEPiX Spring 2011

6 Castor Status Castor manages disk and tape storage –18 million files (at Oct 2011) Recent news: –Moved to T10KC tape media in production in September (Atlas, LHCb) –New (non-Tier1) production instance for Diamond synchrotron Part of a new complete Facilities Data Service which provides data transparent aggregation (StorageD) metadata service (ICAT) and web (TopCAT) and FUSE frontends to access data Coming up (Jan-Mar): –Move to new database hardware and better resilient architecture (using DataGuard) over next 6 months –Major upgrade of CASTOR with a new optimized scheduler and new tape functionality – better for small files –New service ’head nodes’ in test: Dell R410 and Transtec 02/05/2011RAL Site Report - HEPiX Spring 2011

7 Networking WAN –UK NREN JANET now has a 100Gb/s backbone. –Funding for the next upgrade of the NREN SuperJANet6 has recently been approved Site –Sporadic packet loss in site core networking (few %) Still present to a very small degree – intermittent problems with access to LFC dropping for remote users (T2s). May be load related. Asymmetric Data Transfer rates in/out of Tier1 –Many possible causes: Load; FTS settings, disk server settings; TCP/IP tuning, network (LAN & WAN performance) –Have modified FTS settings with some success –Looking at Tier1-UK Tier2 transfers LAN –Another failed 10GbE XFP transceiver, and a death in service of a Nortel 5510 –Three subnets in use for Tier1 –Lots of packet discards into stacks, investigating... Developments –Looking to provide large bandwidth in Tier1 core with ‘mesh-type’ arrangement linked at multiple 40Gb/s with storage connectivity at 10Gb/s. 02/05/2011RAL Site Report - HEPiX Spring 2011

8 Databases Small but significant Oracle installation –Castor, 3D, LFC, FTS Castor database server hardware to be replaced –Old: 2 x 5-node (32bit) RACs, EMC AX4 arrays –New: 2 pairs of 3-node (64bit) RACs, EMC AX4 + Infortrend Arrays –Different ASM architecture – single volumes rather than paired –Dataguard from Production RAC to Standby RAC for resilience –Standby RACs in different building –Backups off the Standby set LFC/FTS –Standby set to be added to the existing setup, Dataguard and backup as per Castor, single volume data, ASM volume architecture changes 3D –ASM volume architecture changes 02/05/2011RAL Site Report - HEPiX Spring 2011

9 Virtualisation Evaluated MS Hyper-V for services virtualization platform –Beginning to roll out local-storage virtualisation for services that don’t need fast failover Struggled for a long time with iSCSI storage arrays (and poor support) –New iSCSI arrays ordered –To support fast-failover etc Cloud project –Department initiative looking at cloud use Talk by Ian Collier 02/05/2011RAL Site Report - HEPiX Spring 2011

10 Projects Quattor –Batch and Storage systems under Quattor management ~6200 cores, 700+ systems (batch), 500+ system (storage) Significant time saving –Significant rollout on Grid services node types CernVM-FS –Major deployment at RAL to cope with software distribution issues –More news in talk by Ian Collier later this week 02/05/2011RAL Site Report - HEPiX Spring 2011

11 Questions? 02/05/2011RAL Site Report - HEPiX Spring 2011

Download ppt "RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL."

Similar presentations

Ads by Google