Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009.

Slides:



Advertisements
Similar presentations
London Tier2 Status O.van der Aa. Slide 2 LT 2 21/03/2007 London Tier2 Status Current Resource Status 7 GOC Sites using sge, pbs, pbspro –UCL: Central,
Advertisements

Steve Traylen Particle Physics Department Experiences of DCache at RAL UK HEP Sysman, 11/11/04 Steve Traylen
Birmingham site report Lawrie Lowe: System Manager Yves Coppens: SouthGrid support HEP System Managers’ Meeting, RAL, May 2007.
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Wahid Bhimji Andy Washbrook And others including ECDF systems team Not a comprehensive update but what ever occurred to me yesterday.
R.Dubois 12 Jan 2005 Generating MC – User Experience 1/6 GLAST SAS Data Handling Workshop – Pipeline Session Running MC & User Experience Template for.
RAID Systems CS Introduction to Operating Systems.
IPv6 testing plans 25 Jan Short term – next 6 weeks Add sites to testbed – Glasgow (DPM storage end point) – Fix DESY – Others? Is GridFTP mesh.
Cambridge Site Report Cambridge Site Report HEP SYSMAN, RAL th June 2010 Santanu Das Cavendish Laboratory, Cambridge Santanu.
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
System Development Life Cycle
Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.
FZU participation in the Tier0 test CERN August 3, 2006.
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
RAL PPD Site Update and other odds and ends Chris Brew.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
Winnie Lacesso Bristol Storage June DPM LCG Storage lcgse01 = DPM built in 2005 by Yves Coppens & Pete Gronbech SuperMicro X5DPAGG (Streamline.
David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP35, Liverpool 11 Sep 2015.
The production deployment of IPv6 on WLCG David Kelsey (STFC-RAL) CHEP2015, OIST, Okinawa 16 Apr 2015.
CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
Wahid, Sam, Alastair. Now installed on production storage Edinburgh: srm.glite.ecdf.ed.ac.uk  Local and global redir work (port open) e.g. root://srm.glite.ecdf.ed.ac.uk//atlas/dq2/mc12_8TeV/NTUP_SMWZ/e1242_a159_a165_r3549_p1067/mc1.
UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.
CSCS Status Peter Kunszt Manager Swiss Grid Initiative CHIPP, 21 April, 2006.
Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
Analysis in STEP09 at TOKYO Hiroyuki Matsunaga University of Tokyo WLCG STEP'09 Post-Mortem Workshop.
Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
LHCb File-Metadata: Bookkeeping Carmine Cioffi Department of Physics, Oxford University UK Metadata Workshop Oxford, 04 July 2006.
Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
BaBar Cluster Had been unstable mainly because of failing disks Very few (
UKI-LT2-RHUL ATLAS STEP09 report Duncan Rand on behalf of the RHUL Grid team.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
SESEC Storage Element (In)Security hepsysman, RAL 0-1 July 2009 Jens Jensen.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
The HEPiX IPv6 Working Group David Kelsey (STFC-RAL) EGI OMB 19 Dec 2013.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
CS Introduction to Operating Systems
RHEV Platform at LHCb Red Hat at CERN 17-18/1/17
Experience of Lustre at QMUL
Pete Gronbech GridPP Project Manager April 2016
LCG Service Challenge: Planning and Milestones
ATLAS Use and Experience of FTS
ATLAS activities in the IT cloud in April 2008
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
Experience of Lustre at a Tier-2 site
Oxford Site Report HEPSYSMAN
Australia Site Report Sean Crosby DPM Workshop – 13 December 2013.
Pete Gronbech, Kashif Mohammad and Vipul Davda
Kashif Mohammad VIPUL DAVDA
The LHCb Computing Data Challenge DC06
Presentation transcript:

Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report 2 Storage access Our main lesson from STEP09 was the big change in ATLAS’ use of storage from lots of small files to fewer large ones. This moved the SE bottleneck from authentication on head node to bandwidth on the disk pools. We implemented network channel bonding on the pools to immediate effect. We had five disk pools taking most of the load; with more servers we should have higher overall bandwidth.

Oxford STEP09 Report 3 Some pool nodes not stressed Some smaller pool nodes were marked read only and hence had no STEP09 data

Oxford STEP09 Report 4 Earlier tests stressed the head nodes network more. STEP09 had little load on head node, and only sporadic network activity

Oxford STEP09 Report 5 Data Transfer – small but successful Oxford are on an 18% share of AODs. MCDISK had over 10TB of old recon Total: files in 548 datasets, size GB c.f. Glasgow where we have 8 recon datasets.

Oxford STEP09 Report 6 Observations - ATLAS Apparently ATLAS were keeping an eLog, we weren’t aware of it (or forgot about it) until the test was essentially over. ATLAS pilot jobs were coming in with Role=Production and Role=Pilot. It makes no sense and it breaks things; please stop. ATLAS’ pilot factory got wedged by a dead CE. We have a dual redundant pair of CEs feeding the main SL4 system, t2ce02 and t2ce04, one of which died. Initially the factory was only using t2ce02, then, when reconfigured, refused to use t2ce04 because of the backlog of (dead) pilots that had been sent to t2ce02. The direct (non rfcp file stager) access kills us; we can’t run many jobs like that at once.

Oxford STEP09 Report 7 Observations - LHCb We were sent very few jobs, as far as we know they ran fine. Er. That’s it.

Oxford STEP09 Report 8 What next? It would be useful to run more tests: –on the individual access methods, one at a time, –Then a repeat of the mixture, but with different DNs for each method. Likely upgrades: –More pools online. We’ve been waiting for DPM 1.7 to do some necessary maintenance. Once that’s done we should have about twelve active pool nodes rather than five. –10Gb Ethernet? (Not soon, and probably only on new kit)

Oxford STEP09 Report 9 Conclusions Oxford has no local ATLAS expert so we suffered from a lack of awareness of the three different submission methods. Have been working with Peter Love to understand why the pilot job submission method was failing. We would like to have a post mortem meeting with Brian to make sure we understand what caused our other various problems.