Tier1 Report Cambridge 23rd October 2006 Martin Bly.

Slides:



Advertisements
Similar presentations
Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.
Advertisements

RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Tier1A Status Andrew Sansum GRIDPP 8 23 September 2003.
Martin Bly RAL Tier1/A RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Tier1 Site Report RAL June 2008 Martin Bly.
March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
GridPP Steve Lloyd, Chair of the GridPP Collaboration Board.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPIX 2009 Umea, Sweden 26 th May 2009.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30 th June 2009.
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory October 2004.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
GridPP3 Project Management GridPP20 Sarah Pearce 11 March 2008.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.
GridPP Building a UK Computing Grid for Particle Physics Professor Steve Lloyd, Queen Mary, University of London Chair of the GridPP Collaboration Board.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
Martin Bly RAL Tier1/A Centre Preparations for the LCG Tier1 Centre at RAL LCG CERN 23/24 March 2004.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
HEPSYSMAN May 2007 Oxford & SouthGrid Computing Status (Ian McArthur), Pete Gronbech May 2007 Physics IT Services PP Computing.
LCG Storage Accounting John Gordon CCLRC – RAL LCG Grid Deployment Board September 2006.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
RAL Site Report Martin Bly SLAC – October 2005.
Your university or experiment logo here User Board Glenn Patrick GridPP20, 11 March 2008.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
RAL Site Report Spring CERN 5-9 May 2008 Martin Bly.
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Luca dell’Agnello INFN-CNAF
Oxford Site Report HEPSYSMAN
Presentation transcript:

Tier1 Report Cambridge 23rd October 2006 Martin Bly

23 October 2006 Cambridge Overview Tier-1 Hardware changes Services

23 October 2006 Cambridge RAL Tier-1 RAL hosts the UK WLCG Tier-1 –Funded via GridPP2 project from PPARC –Supports WLCG and UK Particle Physics users and collaborators VOs: –LHC: Atlas, CMS, LHCb, Alice, (dteam, ops) –Babar CDF, D0, H1, Zeus –bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno, t2k, … Other experiments: –Mice, SNO, UKQCD Theory users …

23 October 2006 Cambridge Staff / Finance Bid to PPARC for ‘GridPP3’ project –For exploitation phase of LHC –September 2007 to March 2011 –Increase in staff and hardware resources –Result early 2007 Tier-1 is recruiting –2 x systems admins, 1 x hardware technician –1 x grid deployment –Replacement for Steve Traylen to head grid deployment and user support group CCLRC internal reorganisation –Business Units Tier1 service is run by E-Science department which is now part of the Facilities Business Unit (FBU)

23 October 2006 Cambridge New building Funding approved for a new computer centre building –3 floors Computer rooms on ground floor, offices above –240m2 low power density room Tape robots, disk servers etc Minimum heat density 1.0 kW/m2, rising to 1.6kW/m2 by 2012 –490m2 high power density room Servers, CPU farms, HPC clusters »Minimum heat density 1.8kW/m2, rising to 2.8Kw/m2 by 2012 –UPS computer room 8 racks + 3 telecoms racks UPS system to provide continuous power of 400A/92KVA three phase for equipment plus power to air conditioning (total approx 800A/184KVA) –Overall Space for 300 racks (+ robots, telecoms) Power: 2700kVA initially, max 5000kVA by 2012 (inc air-con) UPS capacity to meet estimated 1000A/250KVA for minutes for specific hardware for clean shutdown / surviving short breaks –Shared with HPC and other CCLRC computing facilities –Planned to be ready by summer 2008

23 October 2006 Cambridge Hardware changes FY05/06 capacity procurement March 06 –52 x 1U twin dual-core AMD 270 units Tyan 2882 motherboard 4GB RAM, 250GB SATA HDD, dual 1GB NIC 208 job slots, 200kSI2K Commissioned May 06, running well –21 x 5U 24-bay disk servers 168TB (210TB) data capacity Areca 1170 PCI-X 24-port controller 22 x 400GB (500GB) SATA data drives, RAID 6 2 x 250GB SATA system drives, RAID 1 4GB RAM, dual 1GB NIC Commissioning delayed (more…)

23 October 2006 Cambridge Hardware changes (2) FY 06/07 capacity procurements –47 x 3U 16-bay disk servers: 282TB data capacity 3Ware 9550SX-16ML PCI-X 16-port SATA RAID controller 14 x 500GB SATA data drives, RAID 5 2 x 250GB SATA system drives, RAID 1 Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC Delivery expected October 06 –64 x 1U twin dual-core Intel Woodcrest 5130 units (550kSI2K) 4GB RAM, 250GB SATA HDD, dual 1GB NIC Delivery expected November 06 Upcoming in FY 06/07: –Further 210TB disk capacity expected December 06 Same spec as above –High Availability systems with UPS Redundant PSUs, hot-swap paired HDDs etc –AFS replacement –Enhancement to Oracle services (disk arrays or RAC servers)

23 October 2006 Cambridge Hardware changes (3) SL8500 tape robot –Expanded from 6,000 to 10,000 slots –10 drives shared between all users of service –Additional 3 x T10K tape drives for PP –More when CASTOR service working STK Powderhorn –Decommissioned and removed

23 October 2006 Cambridge Storage commissioning  Problems with March 06 procurement: –WD4000YR on Areca 1170, RAID 6 Many instances of multiple drive dropouts Un-warranted drive dropouts and then re-integrating the same drive –Drive electronics (ASIC) on 4000YR (400GB) units changed with no change of model designation We got the updated units –Firmware updates to Areca cards did not solve the issues –WD5000YS (500GB) units swapped-in by WD Fixes most issues but… –Status data and logs from drives showing several additional problems Testing under high load to gather statistics –Production further delayed

23 October 2006 Cambridge Air-con issues Setup –13 x 80KW units in lower machine room, several paired units work together Several ‘hot’ days (for the UK) in July –Sunday: dumped ~70 jobs Alarm system failed to notify operators Pre-emptive automatic shutdown not triggered Ambient air temp reached >35C, machine exhaust temperature >50C ! HPC services not so lucky –Mid week 1: problems over two days attempts to cut load by suspending batch services to protect data services forced to dump 270 jobs –Mid week 2: 2 hot days predicted pre-emptive shutdown of batch services in lower machine room no jobs lost, data services remain available Problem –High ambient air temperature tripped high pressure cut-outs in refrigerant gas circuits –Cascade failure as individual air-con units work harder –Loss of control of machine room temperature Solutions –Sprinklers under units Successful but banned due to Health and Safety concerns –Up-rated refrigerant gas pressure settings to cope with higher ambient air temperature

23 October 2006 Cambridge Operating systems Grid services, batch workers, service machines –SL3, mainly 3.0.3, 3.0.5, 4.2, all ix86 –SL4 before Xmas Considering x86_64 Disk storage –SL4 migration in progress Tape systems –AIX: caches –Solaris: controller –SL3/4: CASTOR systems, newer caches Oracle systems –RHEL3/4 Batch system –Torque/MAUI Fare-shares, allocation by User Board

23 October 2006 Cambridge Databases 3D project –Participating since early days Single Oracle server for testing Successful –Production service 2 x Oracle RAC clusters –Two servers per RAC »Redundant PSUs, hot-swap RAID1 system drives –Single SATA/FC data array –Some transfer rate issues –UPS to come

23 October 2006 Cambridge Storage Resource Management dCache –Performance issues LAN performance very good WAN performance and tuning problems –Stability issues –Now better: increased number of open file descriptors increased number of logins allowed. ADS –In-house system many years old Will remain for some legacy services CASTOR2 –Replace both dCache disk and tape SRMs for major data services –Replace T1 access to existing ADS services –Pre-production service for CMS –LSF for transfer scheduling

23 October 2006 Cambridge Monitoring Nagios –Production service implemented –3 servers (1 master + 2 slaves) –Almost all systems covered 600+ –Replacing SURE –Add call-out facilities

23 October 2006 Cambridge Networking All systems have 1Gb/s connections –Except oldest fraction of the batch farm 10GB/s links almost everywhere –10Gb/s backbone within Tier-1 Complete November 06 Nortel 5530/5510 stacks –10Gb/s link to RAL site backbone 10Gb/s backbone links at RAL expected end November 06 10Gb/s link to RAL Tier-2 –10Gb/s link to UK academic network SuperJanet5 (SJ5) Expected in production by end of November 06 Firewall still an issue –Planned bypass for Tier1 data traffic as part of RAL SJ5 and RAL backbone connectivity developments –10Gb/s OPN link to CERN active September 06 Using pre-production SJ5 circuit Production status at SJ5 handover

23 October 2006 Cambridge Security Notified of intrusion at Imperial College London Searched logs –Unauthorised use of account from suspect source –Evidence of harvesting password maps –No attempt to conceal activity –Unauthorised access to other sites –No evidence of root compromise Notified sites concerned –Incident widespread Passwords changed –All inactive accounts disabled Cleanup –Changed NIS to use shadow password map –Reinstall all interactive systems

Questions?