Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.

Slides:



Advertisements
Similar presentations
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
Advertisements

A couple of slides on RAL PPD Chris Brew CCLRC - RAL - SPBU - PPD.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Martin Bly RAL CSF Tier 1/A RAL Tier 1/A Status HEPiX-HEPNT NIKHEF, May 2003.
Tier1A Status Andrew Sansum GRIDPP 8 23 September 2003.
Martin Bly RAL Tier1/A RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003.
12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status
VMware Infrastructure Alex Dementsov Tao Yang Clarkson University Feb 28, 2007.
Site Report HEPHY-UIBK Austrian federated Tier 2 meeting
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Gareth Smith RAL PPD HEP Sysman. April 2003 RAL Particle Physics Department Site Report.
EU funding for DataGrid under contract IST is gratefully acknowledged GridPP Tier-1A Centre CCLRC provides the GRIDPP collaboration (funded.
Edinburgh Site Report 1 July 2004 Steve Thorn Particle Physics Experiments Group.
Storage Survey and Recent Acquisition at LAL Michel Jouvin LAL / IN2P3
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Tier 1A Storage Procurement 2001/2002 Andrew Sansum CLRC eScience Centre.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
UTA Site Report Jae Yu UTA Site Report 4 th DOSAR Workshop Iowa State University Apr. 5 – 6, 2007 Jae Yu Univ. of Texas, Arlington.
BINP/GCF Status Report BINP LCG Site Registration Oct 2009
Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre.
CASPUR Site Report Andrei Maslennikov Lead - Systems Karlsruhe, May 2005.
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory October 2004.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
28 April 2003Imperial College1 Imperial College Site Report HEP Sysman meeting 28 April 2003.
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.
10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.
HEPix April 2006 NIKHEF site report What’s new at NIKHEF’s infrastructure and Ramping up the LCG tier-1 Wim Heubers / NIKHEF (+SARA)
London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney.
Tier2 Centre in Prague Jiří Chudoba FZU AV ČR - Institute of Physics of the Academy of Sciences of the Czech Republic.
Rob Allan Daresbury Laboratory NW-GRID Training Event 25 th January 2007 Introduction to NW-GRID R.J. Allan CCLRC Daresbury Laboratory.
Martin Bly RAL Tier1/A Centre Preparations for the LCG Tier1 Centre at RAL LCG CERN 23/24 March 2004.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
CASPUR Site Report Andrei Maslennikov Lead - Systems Amsterdam, May 2003.
Tier1A Status Andrew Sansum 30 January Overview Systems Staff Projects.
RAL Site report John Gordon ITD October 1999
RAL Site Report John Gordon HEPiX/HEPNT Catania 17th April 2002.
Queensland University of Technology CRICOS No J VMware as implemented by the ITS department, QUT Scott Brewster 7 December 2006.
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
Gareth Smith RAL PPD RAL PPD Site Report. Gareth Smith RAL PPD RAL Particle Physics Department Overview About 90 staff (plus ~25 visitors) Desktops mainly.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
January 30, 2016 RHIC/USATLAS Computing Facility Overview Dantong Yu Brookhaven National Lab.
Tier1 Status Report Andrew Sansum Service Challenge Meeting 27 January 2004.
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
Tier1A Status Martin Bly 28 April CPU Farm Older hardware: –108 dual processors (450, 600 and 1GHz) –156 dual processor 1400MHz PIII Recent delivery:
RAL Site Report Martin Bly SLAC – October 2005.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
NL Service Challenge Plans
GridPP Tier1 Review Fabric
Presentation transcript:

Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum

Martin Bly RAL Tier1/A 1/2 July Overview Hardware Network Experiences / Challenges Management issues

Martin Bly RAL Tier1/A 1/2 July Tier1 in GRIDPP2 ( ) The Tier-1 Centre will provide GRIDPP2 with a large computing resource of a scale and quality that can be categorised as an LCG Regional Computing Centre January 2004 – GRIDPP2 confirm RAL to be host for Tier1 Service –GRIDPP2 to commence September 2004 Tier1 Hardware budget: –£2.3M over 3 years Staff –Increase from 12.1 to 16.5 by September

Martin Bly RAL Tier1/A 1/2 July Current Tier1 Hardware CPU –350 dual Processor Intel – PIII and Xeon servers mainly rack mounts –About 400KSI2K –RedHat 7.3 –P2/450 tower units decommissioned April 04 –RH72 and Solaris batch services to be phased out this year Disk Service – mainly “standard” configuration –Dual Processor Server –Dual channel SCSI interconnect –External IDE/SCSI RAID arrays (Accusys and Infortrend) –ATA drives (mainly Maxtor) –About 80TB disk –Cheap and (fairly) cheerful Tape Service –STK Powderhorn 9310 silo with B drives

Martin Bly RAL Tier1/A 1/2 July New Hardware 256 x dual Xeon –2GB RAM (32 with 4GB RAM), 120GB HDD, 1Gb NIC: 8 racks. 20 disk servers with two 4TB IDE/SCSI arrays: 5 racks –Infortrend EonStore A16U-G1A units, each with 16 x WD 250GB SATA HDD – 4TB/array raw capacity –Servers: dual Xeon 2GB RAM, dual 120GB SATA system disks, dual 1Gb/s NIC –160Tb raw, ~140TB available (RAID5) Delivered June 15 th, now running commissioning tests

Martin Bly RAL Tier1/A 1/2 July 20046

Martin Bly RAL Tier1/A 1/2 July Next Procurement Need in production by January 2005 –Original schedule of December delivery seems late –Will have to start very soon –Less chance for testing / new technology Exact proportions not agreed, but … –400 KSI2K ( CPUs) –160TB disk –120TB tape?? –Network infrastructure? –Core servers (H/A??) –RedHat? Long range plan needs reviewing – also need long range experiment requirements so as to plan environment updates.

Martin Bly RAL Tier1/A 1/2 July CPU Capacity

Martin Bly RAL Tier1/A 1/2 July Tier1 Disk Capacity (TB)

Martin Bly RAL Tier1/A 1/2 July High Impact Systems Looking at replacement hardware for high impact systems: –/home/csf, /rutherford file systems –Mysql servers –AFS cell –Front end / UI hosts –Data movers –NIS master, Mail server Replacing mix of Solaris, Tru64 Unix and AIX servers with Linux – consolidation of expertise Migrate AFS to OpenAFS and then K5.

Martin Bly RAL Tier1/A 1/2 July Network Firewall Site Router Production SubnetTest Subnet Superjanet Servers Workers Test network (eg MBNG) Server Servers WorkersProductionVLAN TestVLAN SiteRoutableNetwork Rest of Site

Martin Bly RAL Tier1/A 1/2 July Network Firewall Site Router Tier1 Network SuperJanet Servers Workers Test network (eg MBNG) Server Servers Workers TestVLAN ProductionVLAN Rest of Site

Martin Bly RAL Tier1/A 1/2 July UKlight Connection to RAL in September Funded to end 2005 after which probably merges with SuperJanet 5 2.5Gb/s now  10Gb/s from 2006 Effectively dedicated light path to CERN Probably not for Tier1 production but suitable for LCG Data challenges etc, building experience for SuperJanet upgrade. UKLight -> Starlight

Martin Bly RAL Tier1/A 1/2 July Forthcoming Challenges Simplify service – less “duplication” Improve storage management Deploy new Fabric Management RedHat Enterprise 3 upgrade Network upgrade/reconfigure???? Another procurement/install Meet challenge of LCG – professionalism LCG Data Challenges …

Martin Bly RAL Tier1/A 1/2 July Clean up Spaghetti Diagram How to phase out “Classic” service.. Simplify Interfaces: Less GRIDS “More is not always better”

Martin Bly RAL Tier1/A 1/2 July Storage: Plus and Minus ATA and SATA drives External RAID arrays SCSI interconnect Ext2 file system Linux O/S NFS/Xrootd/http/gridftp/bbftp/srb/…. NO SAN No management layer NO HSM 2.5% failure per annum - OK Good architecture, choose well Surprisingly unreliable: change OK – but need journal: XFS? Move to Enterprise 3 Must have SRM Need SAN (Fibre or iSCSI …) Need virtualisation/DCACHE.. ????

Martin Bly RAL Tier1/A 1/2 July Benchmarking Work by George Prassas on various systems including a 3ware/SATA RAID5 system. Tuning gains extra performance on RH variants Performance of RHEL3 NFS servers and disk I/O not special despite tuning, c/w RH73 Considering buying SPEC suite to benchmark everything.

Martin Bly RAL Tier1/A 1/2 July Fabric Management Currently run: –Kickstart – cascading config files, implementing PXE –SURE exception monitoring –Automate – automatic interventions Running out of steam with old systems … –“Only” 800 systems – but many, many flavours –Evaluating Quator – no obvious alternatives – probably deploy –Less convinced by Lemon – bit early – running Nagios in parallel

Martin Bly RAL Tier1/A 1/2 July Yum / Yumit Kickstart scripts now use Yum to bootstrap systems to latest updates Post-install config now uses Yum wherever possible for local additions Yumit: –Nodes use Yum to check their status very night and report to central database –Web interface to show farm status –Easy to see which nodes need updating. Machine ownership tagging, port monitoring project

Martin Bly RAL Tier1/A 1/2 July Futures Storage Architectures –iSCSI, Fibre, dCache –Need to be more sophisticated to allow reallocation of available space CPUs –Xeon, Opteron, Itanium, Intel 64bit x86 architecture Network –Higher speed interconnect, iSCSI

Martin Bly RAL Tier1/A 1/2 July Conclusions After several years of relative stability must start re- engineering many Tier1 components. Must start to rationalise – support limited set of interfaces, operating systems, testbeds … simplify so we can do less better LCG becoming a big driver –Service commitments –Increase resilience and availability –Data challenges and move to steady state Major reality check in 2007!