Download presentation
Presentation is loading. Please wait.
Published byDouglas Cannon Modified over 9 years ago
1
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA
2
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report2 Overview New Building Tier1 move Hardware Networking Developments
3
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report3
4
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report4
5
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report5 New Building + Tier1 move New building handed over in April –Half the department moved in to R89 at the start of May –Tier1 staff and the rest of the department moved in 6 June Tier1 2008 procurements delivered direct to new building –Including new SL8500 tape silo (commissioned then moth-balled) –New hardware entered testing as soon as practicable Non-Tier1 kit including HPC clusters moved starting early June Tier1 moved 22 June – 6 July –Complete success, to schedule –4 contractor firms, all T1 staff –43 racks, a c300 switch and 1 tape silo –Shortest practical service down times
6
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report6 Building issues and developments Building generally working well but it is usual to have teething troubles in new buildings… –Two air-con failures Machine room air temperature reached >40 ºC in 30 minutes –Moisture where it shouldn’t be The original building plan included a Combined Heat and Power unit (CHP) so only enough chilled water capacity was installed until the CHP was installed and working –Plan changed to remove CHP => shortfall in chilled water capacity –Two extra 750kW chillers ordered for installation early in 2010 –Provide planned cooling until 2012/13 –Timely – planning now underway for first water-cooled racks (for non- Tier1 HPC facilities)
7
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report7
8
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report8 Recent New Hardware CPU –~3000kSi2K (~1850 cores) in Supermicro ‘twin’ systems E5420/San Clemente & L5420/Seaburg: 2GB/core, 500GB HDD Now running SL5/x86_64 in production Disk –~2PB in 4U 24-bay chassis, 22 data disks in RAID6, 2 system disks in RAID1 –– 2 vendors: 50 with single Areca controller and 1TB WD data drives –Deployed 60 with dual LSI/3ware/AMCC controllers and 1TB Seagate data drives Second SL8500 silo, 10K slots, 10PB (1TB tapes) –Delivered to new machine room – pass-through to existing robot –Tier1 use – GridPP tape drives have been transferred
9
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report9 Recent / Next Hardware ‘Services’ nodes –10 ‘twins’ (20 systems), twin disks –3 Dell PE 2950 III servers and 4 EMC AX4-5 array units for Oracle RACs –Extra SAN hardware for resilience Procurements running –~15000 HEP-SPEC06 for batch, 3GB RAM and 100GB disk per core => 24GB RAM and 1TB drive for 8 core system –~3PB disk storage in two lots of two tranches, January and April –Additional tape drives: 9 x T10KB Initially for CMS Total 18 x T10KA and 9 x T10KB for PP use To come –More services nodes
10
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report10 Disk Storage ~350 servers –RAID6 on PCI-e SATA controllers, 1Gb/s NIC –SL4 32bit with ext3 –Capacity ~4.2PB in 6TB, 8TB, 10TB, 20TB servers –Mostly deployed for Castor service Three partitions per server –Some NFS (legacy data, xrootd (Babar) Single/multiple partitions as required Array verification using controller tools –20% of capacity in any Castor service class done in a week –Tuesday to Thursday, servers that have gone longest since last verify –Fewer double throws, decrease in overall throw rates –Also using CERN fsprobe to look for silent data corruption
11
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report11 Hardware Issues I Problem during acceptance testing of part of 2008 storage procurement –22 x 1TB SATA drives on PCI-e RAID controller –Drive timeouts, arrays inaccessible Working with supplier to resolve issue –Supplier is working hard on our behalf –Regular phone conferences Engaged with HDD and controller OEMs Appears to be two separate issues –HDD –Controller Possible that resolution of both issues is in sight
12
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report12 Hardware Issues II – Oracle databases New resilient hardware configuration for Oracle Databases SAN using EMC AX4 array sets –Used in ‘mirror’ pairs at Oracle ASM level. Operated well for Castor pre-move and for non-Castor post- move but increasing instances of controller dropout on Castor kit –Eventual crash of one castor array, followed some time later but the second array –Non-Castor array pair also unstable, eventually both crashed together –Data loss from Castor databases due to side effect of having arrays crashing at different times and therefore being out of sync. No unique files ‘lost’. Investigations continuing to find cause – possibly electrical
13
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report13 Networking Force10 C300 in use as core switch since Autumn 08 –Up to 64 x 10GbE at wire speed (32 ports fitted) Not implementing routing on C300 –Turns out the C300 doesn’t support policy-base routing … –… but policy-based routing is on roadmap for C300 software Next year sometime Investigating possibilities for added resilience with additional C300 Doubled up link to OPN gateway to alleviate bottleneck caused by routing UK T2 traffic round site firewall –Working on doubling links to edge stacks Procuring fallback link for OPN to CERN using 4 x 1GbE –Added resilience
14
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report14 Developments I - Batch Services Production service: –SL5.2/64bit with residual SL4.7/32bit (2%) –~4000 cores, ~32000 HEP-SPEC06 Opteron 270, Woodcrest E5130 Harpertown E5410, E5420, L5420 and E5440 –All with 2GB RAM/core –Torque/Maui on SL5/64bit host with 64bit Torque server –Deployed with Quattor in September –Running 50% over-commit on RAM to improve occupancy Previous service: –32bit Torque/Maui server (SL3) and 32bit CPU workers all retired –Hosts used for testing etc
15
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report15 Developments II - Dashboard A new dashboard to provide an operational overview of services and the Tier1 ‘state’ for operations staff, VOs … Constantly evolving –Components can be added/updated/removed –Pulls data from lots of sources Present components –SAM Tests Latest test results for critical services Locally cached for 10 minutes to reduce load –Downtimes –Notices Latest information on Tier 1 operations Only Tier 1 staff can post –Ganglia plots of key components from the Tier1 farm Available at http://www.gridpp.rl.ac.uk/statushttp://www.gridpp.rl.ac.uk/status
16
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report16
17
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report17 Developments III - Quattor Fabric management using Quattor –Will replace existing hand crafted PXE/kickstart and payload scripting –Successful trial of Quattor using virtual systems –Production deployment of SL5/x86_64 WNs and Torque / Maui for 64bit batch service in mid September –Now have additional nodes types under Quattor management –Working on disk servers for Castor See Ian Collier’s talk on our Quattor experiences: http://indico.cern.ch/contributionDisplay.py?contribId=52&sessionId=21&confId=61917
18
26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report18 Towards data taking Lots of work in last 12 months to make services more resilient –Take advantage of LHC delays Freeze on service updates –No ‘fiddling’ with services –Increased stability –Reduced downtimes –Non-intrusive changes But need to do some things such as security updates –Need to manage to avoid service down time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.