UKI-SouthGrid Overview GridPP27 Pete Gronbech SouthGrid Technical Coordinator CERN September 2011.

Slides:

Advertisements

Similar presentations

London Tier2 Status O.van der Aa. Slide 2 LT 2 21/03/2007 London Tier2 Status Current Resource Status 7 GOC Sites using sge, pbs, pbspro –UCL: Central,

Advertisements

Southgrid Status Pete Gronbech: 21 st March 2007 GridPP 18 Glasgow.

Southgrid Status Pete Gronbech: 30 th August 2007 GridPP 19 Ambleside.

QMUL e-Science Research Cluster Introduction (New) Hardware Performance Software Infrastucture What still needs to be done.

UKI-SouthGrid Overview Pete Gronbech SouthGrid Technical Coordinator GridPP 25 - Ambleside 25 th August 2010.

SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Liverpool HEP – Site Report May 2007 John Bland, Robert Fay.

UCL HEP Computing Status HEPSYSMAN, RAL,

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.

S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.

Birmingham site report Lawrie Lowe: System Manager Yves Coppens: SouthGrid support HEP System Managers’ Meeting, RAL, May 2007.

CREAM John Gordon GDB November CREAM number of sites now – gstat2 says 24. Batch systems supported Experiment Tests Feedback from sites. Evaluation.

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

UKI-SouthGrid Overview GridPP30 Pete Gronbech SouthGrid Technical Coordinator and GridPP Project Manager Glasgow - March 2012.

IFIN-HH LHCB GRID Activities Eduard Pauna Radu Stoica.

Cambridge Site Report Cambridge Site Report HEP SYSMAN, RAL th June 2010 Santanu Das Cavendish Laboratory, Cambridge Santanu.

SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.

UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.

Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.

Quarterly report SouthernTier-2 Quarter P.D. Gronbech.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

Winnie Lacesso Bristol Storage June DPM LCG Storage lcgse01 = DPM built in 2005 by Yves Coppens & Pete Gronbech SuperMicro X5DPAGG (Streamline.

BINP/GCF Status Report BINP LCG Site Registration Oct 2009

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPIX 2009 Umea, Sweden 26 th May 2009.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30 th June 2009.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.

Southgrid Technical Meeting Pete Gronbech: 26 th August 2005 Oxford.

UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.

SL6 Status at Oxford. Status  SL6 EMI-3 CREAMCE  SL6 EMI3 WN and gLExec  Small test cluster with three WN’s  Configured using Puppet and Cobbler 

London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney.

Southgrid Technical Meeting Pete Gronbech: 24 th October 2006 Cambridge.

Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.

2-Sep-02Steve Traylen, RAL WP6 Test Bed Report1 RAL and UK WP6 Test Bed Report Steve Traylen, WP6

UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.

Oxford & SouthGrid Update HEPiX Pete Gronbech GridPP Project Manager October 2015.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

Status of India CMS Grid Computing Facility (T2-IN-TIFR) Rajesh Babu Muda TIFR, Mumbai On behalf of IndiaCMS T2 Team July 28, 20111Status of India CMS.

Florida Tier2 Site Report USCMS Tier2 Workshop Livingston, LA March 3, 2009 Presented by Yu Fu for the University of Florida Tier2 Team (Paul Avery, Bourilkov.

RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June

BaBar Cluster Had been unstable mainly because of failing disks Very few (

Scientific Computing in PPD and other odds and ends Chris Brew.

RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Cambridge Site Report John Hill 20 June 20131SouthGrid Face to Face.

UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.

WLCG IPv6 deployment strategy

Pete Gronbech GridPP Project Manager April 2016

LCG Service Challenge: Planning and Milestones

Cluster / Grid Status Update

Oxford Site Report HEPSYSMAN

Presentation transcript:

UKI-SouthGrid Overview GridPP27 Pete Gronbech SouthGrid Technical Coordinator CERN September 2011

SouthGrid September UK Tier 2 reported CPU – Historical View to present

SouthGrid September SouthGrid Sites Accounting as reported by APEL

Status at March TB2011 TB2012 TB bham bris cam ox RALPPD HS HS HS06 bham14502, bris6611, cam11481, ox20342, RALPPD

Resources vs Gridpp3 h/w generated MoU for 2011, TB2012 TB bham95124 bris2735 cam ox RALPPD Total HS HS06 bham2, bris1, cam1, ox2, RALPPD SouthGrid September 2011 HEPSPEC06Storage (TB) Site EDFA-JET Birmingham Bristol Cambridge Oxford RALPPD Totals

OXBrisCamBhamRalPP CreamGlite3.2creamceCreamce 1 glite 3.2 creamce + 2 UMB creamces LCG-ceGone One driving condor Still in production Decommissionin g two weeks ago, availability reduced due to anding of LCG- ce with cream while in draining mode. GlexecYesnoNoInstalled failing some tests Yes ArgusYesNoEGI ArgusInstalled but some issues Yes cvmfsInstalled working for LHCB waiting for Atlas No Not yet, new h/w here to help reorganise the service nodes Deployed not used yet. SouthGrid September

JET Since last meeting the site has been less well utilised. Partly due to down time associated with upgrades. SouthGrid August Essentially a pure CPU site –1772 HepSPEC06 –10.5 Tb of storage All service nodes have been upgraded to glite3.2, with CREAM ces. SE is now 10.5TB Aim is to enable the site for Atlas production work, but the Atlas s/w will be easier to manage if we setup CVMFS. Oxford will help JET do this.

Birmingham Tier 2 Site Not much has changed since the last meeting! Our hardware is still: 24 8-core machines (192 job 9.61-HEP- SPEC06 (Local) 48 4-core machines (192 jobs 7.93-HEP- SPEC06 (Shared) T of DPM storage across 4 pool nodes As for service nodes, we have: 4 CEs (2 CREAM and 2 LCG), serving the two clusters CREAM CE for local cluster also runs Torque 2 ALICE VO Boxes, 1 for each cluster An ARGUS server for the local cluster Usual BDII, APEL and DPM MySQL server nodes All these are running gLite 3.2 SL5 with the exception of the LCG Ces The main change from last time is we have deployed glexec on the local cluster – still waiting on a tarball install for the shared cluster Have just taken delivery of 2 new 8 core systems to replace the 4 quad core service machines. Our Future plans include:Decommission the LCG CEs Consolidate service nodes on to new machines Split the Torque server and CREAM CE Deploy CVMFS Turn the older service machines in to workers (maybe!) Hopefully most of this can be done in one go in the next month or so!

Bristol Status StoRM SE with GPFS, 102TB almost completely full of CMS data Currently running StoRM 1.3 on SL4, plan to upgrade as soon as there is a stable new release, so far 1.6, 1.7 have not been. Bristol has two clusters, both controlled by Physics. Neither of the university HPC clusters are currently being used. New Dell VM hosting node bought to run service VMs on, with help from Oxford. Recent changes New Cream ces front each cluster, one glite 3.2 and one using the new UMD release. (Installed by Kashif ) Glexec and Argus have not yet been installed.

Cambridge Status –CPU : 246 job slots –2445 HS06 –Storage : 201TB [si] online, plus 38TB exclusively used by Camont Most services glite 3.2, exception is the DPM head node and the LCG-ce for the condor cluster. DPM v1.8.0 on of the DPM disk servers, SL5 XFS file system for the storage Batch System – Condor 7.4.4, Torque Supported VOs: Mainly Atlas, LHCb and Camont Recent Changes –CREAM CE with PBS installed –Also working on CREAM-Condor in parallel APEL issues –Problems with the existing APEL implementation for Condor SouthGrid August

RALPP 2056 CPU cores, HS06 980TB disk We now run purely CreamCEs: 1 * glite 3.2 on a VM (soon to be retired), 2 * UMD (though at time of writing, one doesnt seem to be publishing properly). Lately a lot of problems with CE stability, as per discussions on the various mailing lists. Batch system is still Torque from glite 3.1, but we will soon bring up an EMI/UMD torque to replace it (currently installed for test). SE is dCache – planning to ugrade to in the near future. Has been very busy over recent months. SouthGrid August

Oxford Oxfords workload is dominated by ATLAS analysis and production SouthGrid August Installed kit –Autumn 2010 upgrade added 256 cores based on dual 8-core AMD Opterons –These have dual disks striped with s/w raid to improve I/O. –And three new 36 bay disk servers took storage up to 290Tb to meet MoU requirements. Recent Upgrades –Using Departmental money –14 Dell R510 disk servers, faster and smaller chunks with 10Gbit networking –Some Dell6100 WNs installed. –Two 10G network switches and new gigabit switches for the cluster –Are in talks with the University networking with an aim to convert our link from the computer centre to 10Gbit. The current plan is to us QoS to allow us to use idle bandwidth dependant on usage. The dual 10Gbit Campus JANET link is current running at ~3GBit in and 1Gbit out so there is spare available.

Other Oxford Work CMS Tier 3 –Supported by RALPPDs PhEDEx server –Useful for CMS, and for us, keeping the site busy in quiet times –However can block Atlas jobs and during accounting period not so desirable ALICE Support –There is a need to supplement the support given to ALICE by Birmingham. –Made sense to keep this in SouthGrid so Oxford have deployed an ALICE VO box –Site being configured by Kashif in conjunction with Alice support UK Regional Monitoring –Kashif runs the nagios based WLCG monitoring on the servers at Oxford –These include the Nagios server itself, and support nodes for it, SE, MyProxy and WMS/LB –The WMS is an addition to help the UK NGS migrate their testing. –There are very regular software updates for the WLCG Nagios monitoring, ~6 so far this year. Early Adopters –Take part in the testing of CREAM, ARGUS and torque_utils. Have accepted and provided a report for every new version of CREAM this year. SouthGrid Support –Providing support for Bristol –Landslides support at Oxford and Bristol –Helping bring Sussex onto the Grid, (Been too busy in recent months though) SouthGrid September

Sussex has a significant local ATLAS group, their system is designed for the high IO bandwidth patterns that ATLAS analysis can generate. Up and running as a Tier 3 with the Feynman sub-cluster for Particle Physics, Apollo sub-cluster used by rest of University. Feynman : 8 nodes, each node has 2 Intel Xeon 2.67GHz measured at ~15.67 HepSpec06 per core, total of 96 cores. 48GB Ram per node. Apollo currently has 38 nodes totalling 464 cores. The plan is to merge the 2 sub-clusters in next 6 months 81T of Lustre storage shared by both sub-clusters. Everything fully interconnected with infiniband. Cluster is Dell hardware, using three R510 disk servers each with two external disk shelves (each with its own RAID controller). CVMFS installed and working, being used by the ATLAS group as Sussex. In process of installing and configuring grid services to become a Tier 2 site (UKI-SOUTHGRID-SUSX) for SouthGrid. We have registered the service nodes and got grid certificates for them. 4 machines are set up ready for BDII, CreamCE, Apel and SE. BDII and Apel done, working on CE and SE. Hoping to be fully up and running within 2 months. Sussex SouthGrid August

SouthGrid September Conclusions SouthGrid sites utilisation generally improving, but some sites small compared with others. Birmingham supporting Atlas, Alice and LHCb. Bristol; Need to get new version of STORM working if the hope to be a CMS tier2 site Cambridge; only partly using PBS so APEL still reports low. The Condor part does not report correctly into APEL. Accounting metrics come direct from ATLAS so less critical for that. Could enable JET for ATLAS production as they now have enough disk, but ATLAS say they would prefer them to use CVMFS, so we have to help them do that. Oxford upgraded to be optimised for ATLAS analysis, and is involved in many other areas. RALPPD are at full strength, leading the way. Sussex; need some small effort/support to bring them on line