RAL Site Report Martin Bly SLAC – 11-13 October 2005.

Slides:



Advertisements
Similar presentations
12th September 2002Tim Adye1 RAL Tier A Tim Adye Rutherford Appleton Laboratory BaBar Collaboration Meeting Imperial College, London 12 th September 2002.
Advertisements

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Martin Bly RAL CSF Tier 1/A RAL Tier 1/A Status HEPiX-HEPNT NIKHEF, May 2003.
Birmingham site report Lawrie Lowe: System Manager Yves Coppens: SouthGrid support HEP System Managers’ Meeting, RAL, May 2007.
Tier1A Status Andrew Sansum GRIDPP 8 23 September 2003.
Martin Bly RAL Tier1/A RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
1 Andrew Hanushevsky - HEPiX, October 6-8, 1999 Mass Storage For BaBar at SLAC Andrew Hanushevsky Stanford.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
UCL Site Report Ben Waugh HepSysMan, 22 May 2007.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
April 2001HEPix/HEPNT1 RAL Site Report John Gordon CLRC, UK.
RAL PPD Site Update and other odds and ends Chris Brew.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre.
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory October 2004.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
Tier1 Report Cambridge 23rd October 2006 Martin Bly.
28 April 2003Imperial College1 Imperial College Site Report HEP Sysman meeting 28 April 2003.
20-22 September 1999 HPSS User Forum, Santa Fe CERN IT/PDP 1 History  Test system HPSS 3.2 installation in Oct 1997 IBM AIX machines with IBM 3590 drives.
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
Southgrid Technical Meeting Pete Gronbech: 26 th August 2005 Oxford.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.
19th September 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Royal Holloway 19 th September 2003.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.
Martin Bly RAL Tier1/A Centre Preparations for the LCG Tier1 Centre at RAL LCG CERN 23/24 March 2004.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
Tier1A Status Andrew Sansum 30 January Overview Systems Staff Projects.
RAL Site Report John Gordon HEPiX/HEPNT Catania 17th April 2002.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
2-Sep-02Steve Traylen, RAL WP6 Test Bed Report1 RAL and UK WP6 Test Bed Report Steve Traylen, WP6
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
BaBar Cluster Had been unstable mainly because of failing disks Very few (
A UK Computing Facility John Gordon RAL October ‘99HEPiX Fall ‘99 Data Size Event Rate 10 9 events/year Storage Requirements (real & simulated data)
ASCC Site Report Eric Yen & Simon C. Lin Academia Sinica 20 July 2005.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
Oct. 6, 1999PHENIX Comp. Mtg.1 CC-J: Progress, Prospects and PBS Shin’ya Sawada (KEK) For CCJ-WG.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Service Challenge 3 CERN
Oxford Site Report HEPSYSMAN
HEPiX IPv6 Working Group F2F Meeting
Presentation transcript:

RAL Site Report Martin Bly SLAC – October 2005

11-13 October 2005 RAL Site Report - SLAC Overview Intro Hardware OS/Software Services Issues

11-13 October 2005 RAL Site Report - SLAC RAL T1 Rutherford Appleton Lab hosts the UK LCG Tier-1 –Funded via GridPP project from PPARC –Supports LCG and UK Particle Physics users VOs: –LCG: Atlas, CMS, LHCb, (Alice), dteam –Babar –CDF, D0, H1, Zeus –Bio, Pheno Expts: –Minos, Mice, SNO, UKQCD Theory users …

11-13 October 2005 RAL Site Report - SLAC Tier 1 Hardware ~950 CPUs in batch service –1.4GHz, 2.66GHz, 2.8GHz – P3 and P4/Xeon (HT off) –1.0GHz systems retiring as they fail, phase out end Oct '05 –New procurement Aiming for SPECint2000/CPU Systems for testing as part of evaluation of tender First delivery early '06, second delivery in April/May '06 ~40 systems for services (FEs, RB, CE, LCG servers, loggers etc) 60+ disk servers –Mostly SCSI attached IDE or SATA ~220TB unformatted –New procurement: probably PCI/SATA solution Tape robot –6K slots, 1.2PB, 10 drives

11-13 October 2005 RAL Site Report - SLAC Tape Robot / Data Store Current data: 300TB, PP -> 200+TB (110TB Babar) Castor 1 system trials –Many CERN-specifics HSM (Hierarchical Storage Manager) –500TB, DMF (Data Management Facility) SCSI/FC Real file system Data migrates to tape after inactivity Not for PP data –Due November 05 Procurement for a new robot underway –3PB, ~10 tape drives –Expect to order end Oct 05 –Delivery December 05 –In service by March 06 (for SC4) –Castor system

11-13 October 2005 RAL Site Report - SLAC Networking Tier-1 backbone at 4x1Gb/s –Upgrading some links to 10Gb/s Multi-port 10Gb/s layer-2 switch stack as hub when available 1Gb/s production link Tier-1 to RAL site 1Gb/s link to SJ4 (internet) –1Gb/s HW firewall Upgrade site backbone to 10Gb/s expected late '05, early '06 –Link Tier-1 to site at 10Gb/s – possible mid-2006 –Link site to 10Gb/s – mid '06 Site firewall remains an issue – limit 4Gb/s 2x1Gb/s link to UKLight –Separate development network in UK –Links to 2Gb/s, 1Gb/s (pending) –Managed ~90MB/s during SC2, less since Problems with small packet loss causing traffic limitations –Tier-1 to UKLight upgrade to 4x1Gb/s pending,10Gb/s possible –UKLight link to CERN 4Gb/s for early '06 –Over-running hardware upgrade (4 days expanded to 7 weeks)

11-13 October 2005 RAL Site Report - SLAC Tier1 Network Core – SC3 7i-1 7i-3 Router A UKLight Router ADS Caches dCache pools dCache pools Gridftp servers Non-SC hosts Non-SC hosts 4 x 1Gb/s 4 x 1Gb/s 4 x 1Gb/s 2 x 1Gb/s 2 x 1Gb/s to CERN 290Mb/s to Lancaster N x 1Gb/s FW 1Gb/s to SJ4 RAL Site

11-13 October 2005 RAL Site Report - SLAC OS/Software Main services: –Batch, FEs, CE, RB… : SL3 (3.0.3, 3.0.4, 3.0.5) LCG 2_6_0 Torque/MAUI 1 Job/CPU –Disk: RH72 custom, RH73 custom –Some internal services on SL4 (loggers) –Project to use SL4.n for disk servers underway Solaris disk servers decommissioned –Most hardware sold AFS on AIX –Transarc –Project to move to Linux (SL3/4)

11-13 October 2005 RAL Site Report - SLAC Services (1) - Objyserv Objyserv database service (Babar) –Old service on traditional NFS server Custom NFS, heavily loaded, unable to cope with increased activity on batch farm due to threading issues in server Additional server solution with same technology not tenable –New service: Twin ams-based servers, 2 CPUs, HT on, 2 GB RAM SL3, RAID1 data disks 4 servers per host system –Internal redirection using iptables to different server ports depending which of 4 IP addressed used to make the connection Able to cope with some ease: 600+ clients Contact Chris Brew

11-13 October 2005 RAL Site Report - SLAC Services (2) Home file system Home file system migration –Old system: ~85GB on A1000 RAID array Sun Ultra10, Solaris 2.6, 100Mb/s NIC Failed to cope with some forms of pathological use  –New system: ~270GB SCSI RAID5, 6 disk chassis 2.4GHz Xeon, 1GB RAM, 1Gb/s NIC SL3, ext3 Stable under I/O and quota testing, and during backup –Migration: 3 weeks planning 1 week of nightly rsync followed by checksuming –Convince ourselves the rsync works 1 day farm shutdown to migrate 1 single file detected to have checksum error –Quotas for users unchanged… –Keep the old system on standby to restore its backups

11-13 October 2005 RAL Site Report - SLAC Services (3) – Batch Server Catastrophic disk failure on Saturday late evening over a holiday weekend –Staff not expected back till 8:30am Wednesday Problem noted Tuesday morning –Initial inspection - disk a total failure –No easy access to backups Backup tape numbers in logs on failed disk! –No easy recovery solution with no other system staff available –Jobs appear happy – terminating OK, sending sandboxes to gatekeeper etc. But no accounting data, no new jobs started. Wednesday: –Hardware `revised’ with two disks, Software RAID1, clean install of SL3 –Backups located, batch/scheduling configs recovered from tape store –System restarted with MAUI off to allow Torque to sort itself out Queues came up closed –MAUI restarted –Service picked up smoothly Lessons: –Know where the backups are and how to identify which tapes are the right ones –Unmodified batch workers are not good enough for system services

11-13 October 2005 RAL Site Report - SLAC Issues How to run resilient services on non-resilient hardware? –Committed to run 24x365, 98%+ uptime –Modified batch workers with extra disks and HS caddies as servers –Investigating HA-Linux Batch server and scheduling experiments positive RB,CE, BDII, R-GMA … –Databases Building services maintenance –Aircon, power Already two substantial shutdowns in 2006 New building UKLight is a development project network –There have been problems with managing expectations for production services on a development network Unresolved packet loss in CERN-RAL transfers –Under investigation 10Gb/s kit expensive –Components we would like are not yet affordable/available –Pushing against LCG turn-on date

Questions?