RAL Tier1 Report Martin Bly HEPSysMan, RAL, June 30 2009.

Slides:



Advertisements
Similar presentations
RAL Tier1 Operations Andrew Sansum 18 th April 2012.
Advertisements

Storage Procurements Some random thoughts on getting the storage you need Martin Bly Tier1 Fabric Manager.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Martin Bly RAL CSF Tier 1/A RAL Tier 1/A Status HEPiX-HEPNT NIKHEF, May 2003.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Tier1 Site Report RAL June 2008 Martin Bly.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
Tier1 Site Report HEPSysMan 30 June, 1 July 2011 Martin Bly, STFC-RAL.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
RAL PPD Site Update and other odds and ends Chris Brew.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
CASPUR Site Report Andrei Maslennikov Lead - Systems Karlsruhe, May 2005.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
Tier1 Report Cambridge 23rd October 2006 Martin Bly.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.
Tier1 Hardware Review Martin Bly HEPSysMan - RAL, June 2013.
Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.
May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.
Queensland University of Technology CRICOS No J VMware as implemented by the ITS department, QUT Scott Brewster 7 December 2006.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
Your university or experiment logo here Future Disk-Only Storage Project Shaun de Witt GridPP Review 20-June-2012.
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
RAL Site Report HEPiX Spring 2012, Prague April Martin Bly, STFC-RAL.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
RAL Site Report Martin Bly SLAC – October 2005.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
Tier1 Databases GridPP Review 20 th June 2012 Richard Sinclair Database Services Team Leader.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
RAL Site Report Spring CERN 5-9 May 2008 Martin Bly.
Experience of Lustre at QMUL
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Experience of Lustre at a Tier-2 site
Luca dell’Agnello INFN-CNAF
Oxford Site Report HEPSYSMAN
Olof Bärring LCG-LHCC Review, 22nd September 2008
NGS Oracle Service.
GridPP Tier1 Review Fabric
Presentation transcript:

RAL Tier1 Report Martin Bly HEPSysMan, RAL, June

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report2 Overview New building Tier1 move Hardware Services Castor Storage Networking Developments

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report3 New Building I New computer building now ready –Originally due 1 Sept 08 Several interim due dates –Accepted Feb 09 –Cleared for use March 09 Delay mainly caused by building over runs –Generally building projects always take longer than scheduled Unless you have lots of money Planning blight –When can we install the new procurements? –When can we move the Tier1? Originally: during LHC shutdown in Jan-Mar 09

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report4 New Building II Oct 08: the LHC broke … –…which threw all the scheduling up in the air –Forge ahead with Castor upgrades, continue with procurements, possible handover of building in December, plan move for as late as possible in original LHC downtime schedule New LCH schedule announced: –long run through 2009/10, starting early Autumn 09 –move still scheduled for Feb 09 Move delayed indefinitely in mid-Jan when it became clear no firm date could be predicted for acceptance of the building Decide to delay Tier1 move to last possible window before experiments require computing services stable prior to announced LHC data taking schedule, with enough time to attain stability of Tier1 after the move Meanwhile, concentrate on upgrades necessary for long run because there may not be opportunities to do it after data taking starts –Not much time…

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report5 Tier1 move RAL Tier1 moving 22 June – 6 July –Subject to a final go-ahead – June 16th Required agreement of all team leads and service manager, and machine room operations group (not Tier1 staff) Sequence is complicated by the need to reduce the data transfer services down time to a minimum –RAL hosts several services required for UK-wide data access systems (FTS, LFC, PROXY, …) –Regarded as unacceptable to have these down for extended periods Order is Batch, Services nodes and databases, storage, silos –Published in GOCDB (which is being replicated off site!) All CPU and central services moved, disk servers still in process of being moved. First Silo migration started. –Expect to meet schedule for resumption of service

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report6 New Hardware ~3000kSi2K (~1850 cores) in Supermicro ‘twin’ systems –E5420 / San Clemente, L5420 / Seaburg –2GB/core, 500GB HDD ~2PB (110 servers) in 4U 24-bay chassis – 2 vendors, mix of: –single Areca and dual 3ware/AMCC controllers –Seagate and WD drives, 22 data in RAID6, 2 system in RAID1 Second SL8500 silo, 10K slots, 10PB –In new machine room – pass-through to existing robot when relocated –Tier1 use – GridPP tape drives will be transferred Services nodes etc –10 ‘twins’ (20 systems), twin disks –3 Dell PE 2950 III servers and 2 array units for Oracle RACs –Extra SAN hardware for resilience

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report7 Old Hardware Decommissioning program: –199x: Old AFS server replaced –2002: 156 x dual PIII 1.4GHz CPUs Finally! - 44TB SCSI array storage out of use –2003: 80 x dual P4 Xeon 2.66GHz CPUs –Now used for testing 40TB SCSI array storage –2004 – hardware to be decommissioned soon: 256 x dual P4 2.8GHz CPUs 20 x 2 x 3.5TB (~140TB) SCSI array storage None being moved to new building

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report8 Service changes Original AIX/Transarc AFS server replaced by three new Linux/OpenAFS servers –RAID10, 1.4TB each –Project to use AFS to provide gLite software for WNs VO software areas –LHC VO moved to individual servers CMS now 64bit OS due to size of RPM database Additional CEs to increase resilience –Service for individual VOs spread over more than one system RBs decommissioned, WMS/LBs commissioned (3+2) Additional hosts for LFC, FTS front end services –Oracle RAC for FTS/LFC backend 64bit, mirrored ASM Additional faster arrays to be added for resilience

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report9 Batch Services Main service: –SL4.7/32bit –~2200 cores from 2.8GH Xeons to 5440 Harpertowns –Torque/Maui on SL3/32bit server Next Service: –SL5(.3)/64bit with SL4 and 32 bit compatibility libraries –Torque/Maui on SL5/64bit server –In test for VOs –Rollout and resources as required by VOs – timetable not fixed Would like to roll out before data taking, particularly for new hardware –Swing all 64bit cores to SL5 –Probably retire 32-only nodes – no firm decision or commitments

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report10 Castor Problematic upgrade to in late summer 08 –Bugs, stability issues, long running issues only now being understood –Huge drain on Castor and Databases teams at RAL As well as folk at CERN offering support –Much more stable since Xmas –As a result, very careful consideration given to subsequent upgrades –Enhanced testing at RAL –More conservative downtimes considered for deployment before data taking –Extensive consultation with VOs: no overriding need for data taking –Time to do it and recover stability before Tier1 move, or do it after move and before data taking? No: large scale deployment of at RAL postponed –Testing will continue, to be ready if necessary

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report11 Storage ECC issues –Bios configurations Enhanced HDD failure rates –~6%. Looks as if we may be running too cool! Firmware issues –Failure to alarm on certain modes of disk failures –Failure to start rebuilds Firmware on 3ware cards ‘tuned’ to ext3 file systems? Aim to deploy new hardware with SL5/64bit/XFS

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report12 Networking Force10 C300 in use as core switch since Autumn 08 –Up to 64 x 10GbE at wire speed Implementing routing on C300 –Easier routing to LHCOPN and bypass to site firewall for T2 data traffic –Attempts to upgrade to routing have been unsuccessful so far Nortel 55xx series stacks at edge –CPU farm and storage attached to each stack –Experimenting with bigger stacks and trunked 10GbE uplinks to C300 Is combined uplink traffic less than twice the traffic for two single uplinks due to greater probability of data being on a server on same unit or stack? Relocation of Tier1 to new building has split Tier1 network until remaining legacy kit is retired

30 June - 1 July 2009HEPSysMan - RAL Tier1 Report13 Developments Fabric management –Current system: PXE/Kickstart scripts hand crafted –Worked well for several years but now showing the strain of complex deployments –Castor team have been using Puppet for some castor-related extras Gridmap files, config files –Time for a comprehensive review for potential replacements –Quattor top of (short) candidate list Does the necessary Lots of support in-community Recipes for grid-type deployments available –Trial Quattor system using virtual systems –Plan to have a useable production deployment by late August Deploy new procurements Spread to services nodes and existing systems over time