RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.

Slides:



Advertisements
Similar presentations
Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.
Advertisements

Storage Procurements Some random thoughts on getting the storage you need Martin Bly Tier1 Fabric Manager.
UK Status for SC3 Jeremy Coles GridPP Production Manager: Service Challenge Meeting Taipei 26 th April 2005.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Martin Bly RAL CSF Tier 1/A RAL Tier 1/A Status HEPiX-HEPNT NIKHEF, May 2003.
Tier1A Status Andrew Sansum GRIDPP 8 23 September 2003.
Martin Bly RAL Tier1/A RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Tier1 Site Report RAL June 2008 Martin Bly.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
ASGC 1 ASGC Site Status 3D CERN. ASGC 2 Outlines Current activity Hardware and software specifications Configuration issues and experience.
Jeremy Coles - RAL 17th May 2005Service Challenge Meeting GridPP Structures and Status Report Jeremy Coles
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory October 2004.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
Tier1 Report Cambridge 23rd October 2006 Martin Bly.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
GridPP Deployment Status GridPP14 Jeremy Coles 6 th September 2005.
2-3 April 2001HEPSYSMAN Oxford Particle Physics Site Report Pete Gronbech Systems Manager.
Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.
NL Service Challenge Plans Kors Bos, Sander Klous, Davide Salomoni (NIKHEF) Pieter de Boer, Mark van de Sanden, Huub Stoffers, Ron Trompert, Jules Wolfrat.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.
Martin Bly RAL Tier1/A Centre Preparations for the LCG Tier1 Centre at RAL LCG CERN 23/24 March 2004.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
RAL Site Report John Gordon HEPiX/HEPNT Catania 17th April 2002.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
RAL Site Report Martin Bly SLAC – October 2005.
RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
The status of IHEP Beijing Site WLCG Asia-Pacific Workshop Yaodong CHENG IHEP, China 01 December 2006.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
HPEiX Spring RAL Site Report
GridPP Tier1 Review Fabric
Presentation transcript:

RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Overview Intro Storage Batch Oracle Tape Network T1-T2 challenges

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL RAL Tier1/A Rutherford Appleton Lab hosts the UK LCG Tier-1 –Funded via GridPP project from PPARC –Supports LCG and UK Particle Physics users and collaborators VOs: –LCG: Atlas, CMS, LHCb, Alice, (dteam) –Babar –CDF, D0, H1, Zeus –bio, esr, geant4, ilc, magic, minos, pheno, t2k, fusion, cedar Expts: –Mice, SNO, UKQCD Theory users …

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Storage I New: 21 servers –5U, 24 disk chassis –24 port Areca PCI-X SATA II RAID controller –22 x 400GB Western Digital RE series SATA II HDD –8TB/server after RAID overheads (RAID 6) –168TB (10^12) total useable, ~151TB (2^40) after fs –Opteron server Supermicro motherboard, 2 x Opteron GHz, 4GB RAM, 2 x 250GB RAID 1 System disks, 2 x 1Gb/s NIC, redundant PSUs –Delivered March 10 Now in commissioning Issues with some RE drives, Areca firmware –Feb batch of 400GB units ‘different enough’ Expected in service late July –Running SL4.2, possibly SL4.3, with ext3 Issues with XFS under i386 – middleware

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Storage II Existing: SCSI/PATA and SCSI/SATA –~35TB of 1 st year storage now 4 years old Spares difficult to obtain Array PSUs now considered safety hazard To be decommissioned when new capacity ready –Unless the power fails first! –~40TB of 2 nd year storage out of maintenance Obtaining spares for continued operation –~160TB of 3 rd year storage Stable operation, ~20 months old –Migration of 2 nd and 3 rd year servers to SL4 in May/June

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Batch capacity New: 266 kSI2K delivered March 10 –Tyan 1U chassis/motherboard (S2882) –Twin dual-core Opteron 270s –1GB RAM/core (4GB/chassis) –250GB SATA HDD –Dual 1Gb NIC –Commissioning tests 1 April to 3 rd May 1 failure (motherboard) before commissioning started –Entered service 4 th May –Noisy! Existing: ~800kSI2K –Some now 4 years old and still doing well Occasional disk and RAM failures 2nd year units more prone to failures All running SL3.0.3/i386 with security patches

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Oracle Services 2004/5: –Test3d project database machine: re-tasked batch worker Lone Oracle instance within Tier 1 (RHEL 3) –FTS backend database added : –Two further machines for SRB/Castor testing (RHEL 3) 2006: –Load on Test3D system with FTS very high during transfer throughput tests to T2s Impact both FTS and Test3D Migrate FTS database to dedicated machine (RHEL 4 U3) –New specialist hardware for 3D production systems: 4 Servers (2 x RAC) + FC/SATA array (RAID 10), SAN switch RHEL 4 U3, ASM Commissioning – problems with Oracle install

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Services LCG UI and traditional front end services migrated to ‘new’ racked hosts –Faster, 4GB RAM, 1Gb/s NIC –DNS quintet with short TTL –Additional system specifically for CMS Needs service certificate for Phedex Migration of NIS, mail etc from old tower systems to rackmount units Nagios monitoring project –Replace SURE –Rollout to batch systems complete –Some other service systems also

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Tape Store New: –6K slot Storagetek SL8500 robot –Delivered December 2005 –Entered production service 28 th March 2006 –10 x 9940B drives in service + 5 on loan –10 x T10000 drives + 2 EPE drives on evaluation –1000 x T10K 500GB each (native capacity) –Expand to 10K slots summer 2006 –ADS caches: 4 servers, 20TB total –Castor2 caches: 4 servers, 20TB total –New ADS file catalogue server more room for expansion and load sharing if Castor2 implementation is delayed Existing: –6K slot Storagetek Powderhorn silo to be decommissioned after expansion of new robot to 10K slots. –All tapes now in new robot

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Castor2 4 disk servers 10 tape servers 6 services machines Test units

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Schedule for CASTOR 2 Deployment (evolving) Mar-Apr 06 –Testing: functionality, interoperability and database stressing May-Sep –Spec and deploy hardware for production database May: Internal throughput testing with Tier 1 disk servers Jun: CERN Service Challenge throughput testing Jul-Sep: Create full production infrastructure; full deployment on Tier1 Sep-Nov –Spec and deploy second phase production hardware to provide full required capacity Apr 07 –Startup of LHC using CASTOR at RAL

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Network developments 10GE backbone for Tier1 LAN –Nortel 5530s with SR XFPs Partially complete, some 4 x 1Gb/s remaining More in May/June –Looking at potential central 10GE switch solutions Currently stacked 5530s 10GE link to UKLight –10GE link to UKlight router –UK-CERN link 4x1Gb/s, Lancaster Tier 1Gb/s –Expect CERN 10Gb/s summer GE link Tier 1 to RAL site backbone –Installed 9 th May –Expect 10GE site backbone late spring 2006 RAL link to SJ4 1Gb/s –Expect new link to 10Gb/s during autumn 2006 High priority in SJ5 rollout program 10Gb/s will be a problem!

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Tier1 Network Connectivity May Router A UKLight Router 3 x stack ADS Caches CPUs + Disks CPUs + Disks CPUs + Disks CPU + Disks CPUs + Disks 10Gb/s 4 x 1Gb/s 10Gb/s 4 x 1Gb/s to CERN 1Gb/s to Lancaster N x 1Gb/s FW 1Gb/s 1Gb/s to SJ4 RAL Site Tier 1 RAL Tier 2 10Gb/s CPU + Disks stack Gftp/ dCache Oracle RACs Gb/s

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL GridPP service challenges Program of tests of the UK GridPP infrastructure –Aim: stress test the components by exercising both the T1 and T2 hardware and software with extended data throughput runs –3 x 48-hour tests –Opportunity to demonstrate UK T2s and T1 working together First test: trial at high data-rate (100MB/s) to multiple T2s using production SJ4 link (1Gb/s) –Severe stress to RAL site network link to SJ4 – T1 using 95% of bandwidth Reports of site services dropping out: VC, link to Daresbury Lab (corporate Exchange systems etc) Firewall unable to sustain the load – multiple dropouts at unpredictable intervals Test abandoned Site combined traffic throttled to 800Mb/sec at firewall. Firewall vendors working on the issue but retest not possible before Tier 1 starts SC4 work. Second test: sustained 180MB/s from T1 out to T2s –100MB/s on UKLight to Lancaster –70-80MB/s combined on SJ4 to other sites Third test: sustained 180MB/s combined from multiple T2s in to T1 –Problems with FTS and Oracle database backend limit the rates achieved Overall: success –T1 and several T2 worked in coordination to ship data about the UK –Uncovered several weak spots in hardware and software

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL GridPP challenges - throughput plots

9-10 May 2006 RAL Tier 1 Site Report - HEPSysMan Spring RAL Miscellany Noise –Lower machine room now a ‘hearing protection zone’ –Mandatory to use ear defenders –Tours difficult Building –New computer building being designed –Available sometime 2009?