RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.

Slides:



Advertisements
Similar presentations
Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.
Advertisements

GridPP News NeSC opening “Media” dissemination Tier 1/A hardware Web pages Collaboration meetings Nick Brook University of Bristol.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Martin Bly RAL CSF Tier 1/A RAL Tier 1/A Status HEPiX-HEPNT NIKHEF, May 2003.
Tier1A Status Andrew Sansum GRIDPP 8 23 September 2003.
Martin Bly RAL Tier1/A RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Tier1 Site Report RAL June 2008 Martin Bly.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
ASGC 1 ASGC Site Status 3D CERN. ASGC 2 Outlines Current activity Hardware and software specifications Configuration issues and experience.
Jeremy Coles - RAL 17th May 2005Service Challenge Meeting GridPP Structures and Status Report Jeremy Coles
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre.
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory October 2004.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
Tier1 Report Cambridge 23rd October 2006 Martin Bly.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
GridPP Deployment Status GridPP14 Jeremy Coles 6 th September 2005.
2-3 April 2001HEPSYSMAN Oxford Particle Physics Site Report Pete Gronbech Systems Manager.
Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.
Martin Bly RAL Tier1/A Centre Preparations for the LCG Tier1 Centre at RAL LCG CERN 23/24 March 2004.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
RAL Site Report John Gordon HEPiX/HEPNT Catania 17th April 2002.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
RAL Site Report Martin Bly SLAC – October 2005.
RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
RAL Site Report Spring CERN 5-9 May 2008 Martin Bly.
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Olof Bärring LCG-LHCC Review, 22nd September 2008
HPEiX Spring RAL Site Report
GridPP Tier1 Review Fabric
Presentation transcript:

RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly

3-5 April 2006 RAL Site Report - Rome Overview Intro Storage Batch Oracle Tape Network T1-T2 challenges

3-5 April 2006 RAL Site Report - Rome RAL Tier1/A Rutherford Appleton Lab hosts the UK LCG Tier-1 –Funded via GridPP project from PPARC –Supports LCG and UK Particle Physics users and collaborators VOs: –LCG: Atlas, CMS, LHCb, Alice, (dteam) –Babar –CDF, D0, H1, Zeus –bio, esr, geant4, ilc, magic, minos, pheno, t2k Expts: –Mice, SNO, UKQCD Theory users …

3-5 April 2006 RAL Site Report - Rome Storage I New: 21 servers –5U, 24 disk chassis –24 port Areca PCI-X SATA II RAID controller –22 x 400GB Western Digital RE series SATA II HDD –8TB/server after RAID overheads (RAID 6) –168TB (10^12) total useable –Opteron server Supermicro motherboard, 2 x Opteron GHz, 4GB RAM, 2 x 250GB RAID 1 System disks, 2 x 1Gb/s NIC, redundant PSUs –Delivered March 10 Now in commissioning Issues with some RE drives Expected in service late May –Running SL4.2, possibly SL4.3, with ext3 Issues with XFS tests

3-5 April 2006 RAL Site Report - Rome Storage II Existing: SCSI/PATA and SCSI/SATA –35TB of 1 st year storage now 4 years old Spares difficult to obtain Array PSUs now considered safety hazard To be decommissioned when new capacity ready –Unless the power fails first! –~40TB of 2 nd year storage out of maintenance Obtaining spares for continued operation –~160TB of 3 rd year storage Stable operation, ~20 months old –Migration of 2 nd and 3 rd year servers to SL4 in May/June

3-5 April 2006 RAL Site Report - Rome Batch capacity New: 200+kSI2K delivered March 10 –Tyan 1U chassis/motherboard (S2882) –Twin dual-core Opteron 270s –1GB RAM/core (4GB/chassis) –250GB SATA HDD –Dual 1Gb NIC –In commissioning - 4 week load test Expected to enter service late April early May Existing: 800kSI2K –Some now 4 years old and still doing well Occasional disk and RAM failures 2nd year units more prone to failures All running SL3.0.3/i386 with security patches

3-5 April 2006 RAL Site Report - Rome Oracle Services 2004/5: –Test3d project database machine: re-tasked batch worker Lone Oracle instance within Tier 1 (RHEL 3) –FTS backend database added : –Two further machines for SRB testing (RHEL 3) 2006: –Load on Test3D system with FTS very high during transfer throughput tests to T2s Impact both FTS and Test3D Migrate FTS database to dedicated machine (RHEL 4 U3) –New specialist hardware for 3D production systems: 4 Servers (2 x RAC) + FC/SATA array (RAID 10) RHEL 4 U3, ASM Commissioning

3-5 April 2006 RAL Site Report - Rome Services LCG UI and traditional front end services migrated to ‘new’ racked hosts –Faster, 4GB RAM, 1Gb/s NIC –DNS quintet with short TTL –Additional system specifically for CMS Needs service certificate for Phedex Migration of NIS, mail etc from older tower systems Nagios monitoring project –Replace SURE –Rollout April/May

3-5 April 2006 RAL Site Report - Rome Tape Store New: –6K slot Storagetek SL8500 robot –Delivered December 2005 –Entered production service 28 th March 2006 –10 x 9940B drives in service + 5 on loan –10 x T10000 drives + 2 EPE drives on evaluation –1000 x T10K 500GB each (native capacity) –Expand to 10K slots summer 2006 –ADS caches: 4 servers, 20TB total –Castor2 caches: 4 servers, 20TB total –New ADS file catalogue server more room for expansion and load sharing if Castor2 implementation is delayed Existing: –6K slot Storagetek Powderhorn silo to be decommissioned after expansion of new robot to 10K slots. –All tapes now in new robot

3-5 April 2006 RAL Site Report - Rome

3-5 April 2006 RAL Site Report - Rome Castor2 4 disk servers 10 tape servers 6 services machines Test units

3-5 April 2006 RAL Site Report - Rome Schedule for CASTOR 2 Deployment (evolving) Mar-Apr 06 –Testing: functionality, interoperability and database stressing May-Sep –Spec and deploy hardware for production database May: Internal throughput testing with Tier 1 disk servers Jun: CERN Service Challenge throughput testing Jul-Sep: Create full production infrastructure; full deployment on Tier1 Sep-Nov –Spec and deploy second phase production hardware to provide full required capacity Apr 07 –Startup of LHC using CASTOR at RAL

3-5 April 2006 RAL Site Report - Rome Network developments 10GE backbone for Tier1 LAN –Nortel 5530s with SR XFPs Partially complete, some 4 x 1Gb/s remaining Full backbone in May –Looking at potential central 10GE switch solutions 10GE link to UKLight –CERN 4x1Gb/s, Lancaster Tier 1Gb/s –Expect CERN 10Gb/s summer 2006 Tier 1 link to RAL site backbone 2 x 1Gb/s –Expect 10GE site backbone late spring 2006 Tier 1 will get 10GE link thereafter RAL link to SJ4 1Gb/s –Expect new link to 10Gb/s during autumn 2006 High priority in SJ5 rollout program 10Gb/s will be a problem!

3-5 April 2006 RAL Tier1 Network Connectivity March Router A UKLight Router 3 x stack ADS Caches CPUs + Disks CPUs + Disks CPUs + Disks CPU + Disks CPUs + Disks 10Gb/s 4 x 1Gb/s 10Gb/s 4 x 1Gb/s to CERN 1Gb/s to Lancaster N x 1Gb/s FW 1Gb/s 1Gb/s to SJ4 RAL Site 2 x 1Gb/s Tier 1 RAL Tier 2 10Gb/s CPU + Disks stack Oracle RACs

3-5 April 2006 RAL Site Report - Rome GridPP service challenges Program of tests of the UK GridPP infrastructure –Aim: stress test the components by exercising both the T1 and T2 hardware and software with extended data throughput runs –3 x 48-hour tests –Opportunity to demonstrate UK T2s and T1 working together First test: trial at high data-rate (100MB/s) to multiple T2s using production SJ4 link (1Gb/s) –Severe stress to RAL site network link to SJ4 – T1 using 95% of bandwidth Reports of site services dropping out: VC, link to Daresbury Lab (corporate Exchange systems etc) Firewall unable to sustain the load – multiple dropouts at unpredictable intervals Test abandoned Site combined traffic throttled to 800Mb/sec at firewall. Firewall vendors working on the issue but retest not possible before Tier 1 starts SC4 work. Second test: sustained 180MB/s from T1 out to T2s –100MB/s on UKLight to Lancaster –70-80MB/s combined on SJ4 to other sites Third test: sustained 180MB/s combined from multiple T2s in to T1 –Problems with FTS and Oracle database backend limit the rates achieved Overall: success –T1 and several T2 worked in coordination to ship data about the UK –Uncovered several weak spots in hardware and software

3-5 April 2006 RAL Site Report - Rome GridPP challenges - throughput plots