RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.

Slides:



Advertisements
Similar presentations
Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011.
Advertisements

RAL Tier1 Operations Andrew Sansum 18 th April 2012.
Storage Procurements Some random thoughts on getting the storage you need Martin Bly Tier1 Fabric Manager.
Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Tier1 Site Report HEPSysMan 30 June, 1 July 2011 Martin Bly, STFC-RAL.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
Tier 3g Infrastructure Doug Benjamin Duke University.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory October 2004.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
Tier1 Report Cambridge 23rd October 2006 Martin Bly.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.
Tier1 Hardware Review Martin Bly HEPSysMan - RAL, June 2013.
London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney.
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.
RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
RAL Site report John Gordon ITD October 1999
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
RAL Site Report HEPiX Spring 2012, Prague April Martin Bly, STFC-RAL.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
RAL Site Report Martin Bly SLAC – October 2005.
RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
Title of the Poster Supervised By: Prof.*********
Experience of Lustre at QMUL
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Update on Plan for KISTI-GSDC
Experience of Lustre at a Tier-2 site
Luca dell’Agnello INFN-CNAF
Oxford Site Report HEPSYSMAN
HPEiX Spring RAL Site Report
GridPP Tier1 Review Fabric
HEPiX IPv6 Working Group F2F Meeting
RHUL Site Report Govind Songara, Antonio Perez,
Presentation transcript:

RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report2 Overview New Building Tier1 move Hardware Networking Developments

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report3

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report4

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report5 New Building + Tier1 move New building handed over in April –Half the department moved in to R89 at the start of May –Tier1 staff and the rest of the department moved in 6 June Tier procurements delivered direct to new building –Including new SL8500 tape silo (commissioned then moth-balled) –New hardware entered testing as soon as practicable Non-Tier1 kit including HPC clusters moved starting early June Tier1 moved 22 June – 6 July –Complete success, to schedule –4 contractor firms, all T1 staff –43 racks, a c300 switch and 1 tape silo –Shortest practical service down times

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report6 Building issues and developments Building generally working well but it is usual to have teething troubles in new buildings… –Two air-con failures Machine room air temperature reached >40 ºC in 30 minutes –Moisture where it shouldn’t be The original building plan included a Combined Heat and Power unit (CHP) so only enough chilled water capacity was installed until the CHP was installed and working –Plan changed to remove CHP => shortfall in chilled water capacity –Two extra 750kW chillers ordered for installation early in 2010 –Provide planned cooling until 2012/13 –Timely – planning now underway for first water-cooled racks (for non- Tier1 HPC facilities)

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report7

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report8 Recent New Hardware CPU –~3000kSi2K (~1850 cores) in Supermicro ‘twin’ systems E5420/San Clemente & L5420/Seaburg: 2GB/core, 500GB HDD Now running SL5/x86_64 in production Disk –~2PB in 4U 24-bay chassis, 22 data disks in RAID6, 2 system disks in RAID1 –– 2 vendors: 50 with single Areca controller and 1TB WD data drives –Deployed 60 with dual LSI/3ware/AMCC controllers and 1TB Seagate data drives Second SL8500 silo, 10K slots, 10PB (1TB tapes) –Delivered to new machine room – pass-through to existing robot –Tier1 use – GridPP tape drives have been transferred

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report9 Recent / Next Hardware ‘Services’ nodes –10 ‘twins’ (20 systems), twin disks –3 Dell PE 2950 III servers and 4 EMC AX4-5 array units for Oracle RACs –Extra SAN hardware for resilience Procurements running –~15000 HEP-SPEC06 for batch, 3GB RAM and 100GB disk per core => 24GB RAM and 1TB drive for 8 core system –~3PB disk storage in two lots of two tranches, January and April –Additional tape drives: 9 x T10KB Initially for CMS Total 18 x T10KA and 9 x T10KB for PP use To come –More services nodes

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report10 Disk Storage ~350 servers –RAID6 on PCI-e SATA controllers, 1Gb/s NIC –SL4 32bit with ext3 –Capacity ~4.2PB in 6TB, 8TB, 10TB, 20TB servers –Mostly deployed for Castor service Three partitions per server –Some NFS (legacy data, xrootd (Babar) Single/multiple partitions as required Array verification using controller tools –20% of capacity in any Castor service class done in a week –Tuesday to Thursday, servers that have gone longest since last verify –Fewer double throws, decrease in overall throw rates –Also using CERN fsprobe to look for silent data corruption

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report11 Hardware Issues I Problem during acceptance testing of part of 2008 storage procurement –22 x 1TB SATA drives on PCI-e RAID controller –Drive timeouts, arrays inaccessible Working with supplier to resolve issue –Supplier is working hard on our behalf –Regular phone conferences Engaged with HDD and controller OEMs Appears to be two separate issues –HDD –Controller Possible that resolution of both issues is in sight

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report12 Hardware Issues II – Oracle databases New resilient hardware configuration for Oracle Databases SAN using EMC AX4 array sets –Used in ‘mirror’ pairs at Oracle ASM level. Operated well for Castor pre-move and for non-Castor post- move but increasing instances of controller dropout on Castor kit –Eventual crash of one castor array, followed some time later but the second array –Non-Castor array pair also unstable, eventually both crashed together –Data loss from Castor databases due to side effect of having arrays crashing at different times and therefore being out of sync. No unique files ‘lost’. Investigations continuing to find cause – possibly electrical

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report13 Networking Force10 C300 in use as core switch since Autumn 08 –Up to 64 x 10GbE at wire speed (32 ports fitted) Not implementing routing on C300 –Turns out the C300 doesn’t support policy-base routing  … –… but policy-based routing is on roadmap for C300 software Next year sometime Investigating possibilities for added resilience with additional C300 Doubled up link to OPN gateway to alleviate bottleneck caused by routing UK T2 traffic round site firewall –Working on doubling links to edge stacks Procuring fallback link for OPN to CERN using 4 x 1GbE –Added resilience

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report14 Developments I - Batch Services Production service: –SL5.2/64bit with residual SL4.7/32bit (2%) –~4000 cores, ~32000 HEP-SPEC06 Opteron 270, Woodcrest E5130 Harpertown E5410, E5420, L5420 and E5440 –All with 2GB RAM/core –Torque/Maui on SL5/64bit host with 64bit Torque server –Deployed with Quattor in September –Running 50% over-commit on RAM to improve occupancy Previous service: –32bit Torque/Maui server (SL3) and 32bit CPU workers all retired –Hosts used for testing etc

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report15 Developments II - Dashboard A new dashboard to provide an operational overview of services and the Tier1 ‘state’ for operations staff, VOs … Constantly evolving –Components can be added/updated/removed –Pulls data from lots of sources Present components –SAM Tests Latest test results for critical services Locally cached for 10 minutes to reduce load –Downtimes –Notices Latest information on Tier 1 operations Only Tier 1 staff can post –Ganglia plots of key components from the Tier1 farm Available at

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report16

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report17 Developments III - Quattor Fabric management using Quattor –Will replace existing hand crafted PXE/kickstart and payload scripting –Successful trial of Quattor using virtual systems –Production deployment of SL5/x86_64 WNs and Torque / Maui for 64bit batch service in mid September –Now have additional nodes types under Quattor management –Working on disk servers for Castor See Ian Collier’s talk on our Quattor experiences:

26-30 October 2009HEPiX Fall 2009 LBL - RAL Site Report18 Towards data taking Lots of work in last 12 months to make services more resilient –Take advantage of LHC delays Freeze on service updates –No ‘fiddling’ with services –Increased stability –Reduced downtimes –Non-intrusive changes But need to do some things such as security updates –Need to manage to avoid service down time