Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.

Slides:



Advertisements
Similar presentations
Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.
Advertisements

RAL Tier1 Operations Andrew Sansum 18 th April 2012.
RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.
Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Tier1 Site Report RAL June 2008 Martin Bly.
March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
Computing/Tier 3 Status at Panjab S. Gautam, V. Bhatnagar India-CMS Meeting, Sept 27-28, 2007 Delhi University, Delhi Centre of Advanced Study in Physics,
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
BINP/GCF Status Report BINP LCG Site Registration Oct 2009
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30 th June 2009.
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory October 2004.
Overview of day-to-day operations Suzanne Poulat.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
Tier1 Report Cambridge 23rd October 2006 Martin Bly.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
Tier2 Centre in Prague Jiří Chudoba FZU AV ČR - Institute of Physics of the Academy of Sciences of the Czech Republic.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
Martin Bly RAL Tier1/A Centre Preparations for the LCG Tier1 Centre at RAL LCG CERN 23/24 March 2004.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
LCG Storage Accounting John Gordon CCLRC – RAL LCG Grid Deployment Board September 2006.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
Status of India CMS Grid Computing Facility (T2-IN-TIFR) Rajesh Babu Muda TIFR, Mumbai On behalf of IndiaCMS T2 Team July 28, 20111Status of India CMS.
Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
RAL Site Report Martin Bly SLAC – October 2005.
RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.
Your university or experiment logo here User Board Glenn Patrick GridPP20, 11 March 2008.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
Development of a Tier-1 computing cluster at National Research Centre 'Kurchatov Institute' Igor Tkachenko on behalf of the NRC-KI Tier-1 team National.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
RAL Site Report Spring CERN 5-9 May 2008 Martin Bly.
LCG Service Challenge: Planning and Milestones
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Update on Plan for KISTI-GSDC
Luca dell’Agnello INFN-CNAF
HPEiX Spring RAL Site Report
Presentation transcript:

Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Overview RAL / Tier-1 Hardware Services Monitoring Networking

22 May 2007 Tier1 Site Report - HEPSysMan, RAL RAL / Tier-1 Change in UK science funding structure: –CCLRC and PPARC have been merged to form a new Research Council: Science and Technology Facilities Council (STFC) Combined remit, looks after large facilities, grants etc… –RAL is one of the several STFC institutes –Some internal restructuring and name changes in Business Units New corporate styles, etc RAL hosts the UK WLCG Tier-1 –Funded via GridPP2 project by STFC –Supports WLCG and UK Particle Physics users and collaborators atlas, cms, lhcb, alice, dteam, ops, babar, cdf, d0, h1, zeus, bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno, t2k, mice, sno, ukqcd, harp, theory users … –Expect no change operationally as a result of STFC ‘ownership’.

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Finance & Staff GridPP3 project funding approved –“From Production to Exploitation” –Provides for UK Tier-1, Tier-2s and some software activity –April 2008 to March 2011 –Tier1: Increase in staff: 17 FTE (+3.4 FTE from ESC) Hardware resources for WLCG: ~£7.2M Tight funding settlement, contingencies for HW and power Additional Tier1 staff now in-post –2 x systems administrators: James Thorne, Lex Holt –1 x hardware technician: James Adams –1 x PPS admin: Marian Klein

22 May 2007 Tier1 Site Report - HEPSysMan, RAL New Computing Building Funding for a new computer centre building –Funded by RAL/STFC as part of site infrastructure –Shared with HPC and other STFC computing facilities –Design complete: ~300 racks tape silos –Planning permission granted –Tender running for construction and fitting out –Construction starts July, planned to be ready for occupation mid August 2008

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Tape Silo Sun SL8500 tape silo –Expanded from 6,000 to 10,000 slots –8 robot trucks –18 x T10K, 10 x 9940B drives 8 x T10K tape drives for CASTOR –Second silo planned this FY SL8500, 6,000 slots Tape passing between silos may be possible

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Capacity Hardware FY06/07 CPU –64 x 1U twin dual-core Woodcrest 5130 units: ~550kSI2K 4GB RAM, 250GB DATA HDD, dual 1GB NIC Commissioned January 07 –Total capacity ~1550 kSIbase2K, ~1275 job slots Disk –86 x 3U 16-bay servers: 516TB(10^12) data capacity 3Ware 9550SX, 14 x 500GB data drives, 2 x 250GB system drives Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC Commissioned March 07, into production service as required –Total disk storage ~900TB ~40TB being phased out at end of life (~5 years)

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Storage commissioning Problems with recent storage systems now solved! –Issue: WD5000YS (500GB) units show ‘random throws’ from RAID units –No host logs of problems, testing drive offline shows no drive issues –Common to two completely different hardware configurations Problem isolated: –Non-return loop in the drive firmware –Drive head needs to move occasionally to avoid ploughing a furrow in the drive platter lubricant: due to timeout issues in some circumstances, the drive would just sit there stuck, and communication with the controller would time out, causing a drive eject –Yanking the drive resets the electronics and no problem is evident (or logged) WD patched firmware once problem isolated Subsequent reports of the same or similar problem from non-HEP sites

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Operating systems Grid services, batch workers, service machines –Mainly SL3.0.3, SL3.0.5, SL3.0.8, some SL4.2, SL4.4, all ix86 Planning for x86_64 WNs, SL4 batch services Disk storage –New servers using SL4/i386/ext3, some x86_64 CASTOR, dCache, NFS, Xrootd –Older servers: SL4 migration in progress Tape systems –AIX: ADS tape caches –Solaris: silo/library controllers –SL3/4: CASTOR caches, SRMs, tape servers Oracle systems –RHEL3/4 Batch system –Torque/MAUI Problems with jobs ‘failing to launch’ Reduced using Torque with rpp disabled

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Services UK National BDII –Single system was overloaded Dropping connections, failing to reply to queries Failing SAM tests, detrimental to UK Grid services (and reliability stats!) –Replaced single unit with a DNS-‘balanced’ pair in Feb 07 –Extended to triplet in March UIs –Migration to gLite-flavour in May 07 CE –Overloaded system moved to twin dual-core (AMD) node with faster SATA drive –Plan a second (gLite) CE to split load RB –Second RB added to ease load PPS –Service now in production –Testing gLite-flavour middleware AFS –Upgrade of hardware postponed, pending review of service needs

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Storage Resource Management dCache –Performance issues LAN performance very good WAN performance and tuning problems –Stability issues –Now better: increased number of open file descriptors, number of logins allowed Java 1.4 -> 1.5 ADS –In-house system, many years old Will remain for some legacy services, but not planned for PP CASTOR –Replacing both dCache disk and tape SRMs for major data services –Replace T1 access to existing tape services –Production services for ATLAS, CMS, LHCb –CSA06 to castor OK –Support issues –‘Plan B’

22 May 2007 Tier1 Site Report - HEPSysMan, RAL CASTOR Issues Lots of issues causing stability problems Scheduling transfer jobs to servers in the wrong service class Problems upgrading to latest version –T1 running older versions, not in use at CERN –Struggle to get new versions running on test instance –Support patchy Performance on disk servers with single file system poor compared to performance on servers with multiple file systems: –Castor schedules transfers per file system whereas LSF uses limits per disk server –New LSF plug-in should resolve but needs latest LSF and Castor WAN tuning not good for LAN transfers Problem with ‘Reserved Space’ Lots of other niggles and unwarranted assumptions –Short hostnames

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Monitoring Nagios –Production service implemented –Replaces SURE for alarm and exception handling –3 servers (1 master + 2 slaves) –Almost all systems covered 800+ –Some stability issues with server Memory use –Call-out facilities to be added Ganglia –Updating to latest version More stable CACTI –Network monitoring

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Networking All systems have 1Gb/s connections –Except oldest fraction of the batch farm 10Gb/s interlinks everywhere –10Gb/s backbone complete within T1 Nortel 5530/5510 stacks –Considering T1 internal topology – will it meet the intra-farm transfer rates? –10Gb/s link to RAL site backbone 10Gb/s link to RAL T2 –10Gb/s link to UK academic network SuperJanet5 (SJ5) Direct link to SJ5 rather than via local MAN Active 10 April 2007 Link to Firewall now at 2Gb/s –Planned 10Gb/s bypass for T1-T2 data traffic –10Gb/s OPN link to CERN T1-T1 routing via OPN being implemented

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Testing developments Viglen HX2220i ‘Twin’ system –Intel Clovertown ‘Quads’ –Benchmarking, running in batch system Viglen HS216a storage –3U 16-bay with 3ware 9650SX-16 controller –Similar to recent servers controller is PCI-E, RAID6 Data Direct Networks storage –‘RAID’-style controller with disk shelves attached via FC, FC attached to servers. –Aim to test performance under various load types and SRM clients

Comments, Questions?