UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.

Slides:

Advertisements

Similar presentations

Southgrid Status Pete Gronbech: 21 st March 2007 GridPP 18 Glasgow.

Advertisements

UKI-SouthGrid Overview Pete Gronbech SouthGrid Technical Coordinator GridPP 25 - Ambleside 25 th August 2010.

SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

UKI-SouthGrid Overview GridPP27 Pete Gronbech SouthGrid Technical Coordinator CERN September 2011.

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.

Birmingham site report Lawrie Lowe: System Manager Yves Coppens: SouthGrid support HEP System Managers’ Meeting, RAL, May 2007.

Duke Atlas Tier 3 Site Doug Benjamin (Duke University)

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

UKI-SouthGrid Overview GridPP30 Pete Gronbech SouthGrid Technical Coordinator and GridPP Project Manager Glasgow - March 2012.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Cambridge Site Report Cambridge Site Report HEP SYSMAN, RAL th June 2010 Santanu Das Cavendish Laboratory, Cambridge Santanu.

SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.

UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.

Quarterly report SouthernTier-2 Quarter P.D. Gronbech.

RAL PPD Site Update and other odds and ends Chris Brew.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

BINP/GCF Status Report BINP LCG Site Registration Oct 2009

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

GridKa SC4 Tier2 Workshop – Sep , Warsaw Tier2 Site.

12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPIX 2009 Umea, Sweden 26 th May 2009.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30 th June 2009.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.

Southgrid Technical Meeting Pete Gronbech: 26 th August 2005 Oxford.

SL6 Status at Oxford. Status  SL6 EMI-3 CREAMCE  SL6 EMI3 WN and gLExec  Small test cluster with three WN’s  Configured using Puppet and Cobbler 

13th October 2011Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and GridPP Project Manager.

Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.

Southgrid Technical Meeting Pete Gronbech: 24 th October 2006 Cambridge.

Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.

14th October 2010Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and SouthGrid Technical Co-ordinator.

HEPSYSMAN May 2007 Oxford & SouthGrid Computing Status (Ian McArthur), Pete Gronbech May 2007 Physics IT Services PP Computing.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.

IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.

UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.

Oxford & SouthGrid Update HEPiX Pete Gronbech GridPP Project Manager October 2015.

11th October 2012Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and GridPP Project Manager.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

Status of India CMS Grid Computing Facility (T2-IN-TIFR) Rajesh Babu Muda TIFR, Mumbai On behalf of IndiaCMS T2 Team July 28, 20111Status of India CMS.

RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June

BaBar Cluster Had been unstable mainly because of failing disks Very few (

Scientific Computing in PPD and other odds and ends Chris Brew.

RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Cambridge Site Report John Hill 20 June 20131SouthGrid Face to Face.

UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.

RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.

18/12/03PPD Christmas Lectures 2003 Grid in the Department A Guide for the Uninvolved PPD Computing Group Christmas Lecture 2003 Chris Brew.

WLCG IPv6 deployment strategy

Title of the Poster Supervised By: Prof.*********

Pete Gronbech GridPP Project Manager April 2016

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

Experience of Lustre at a Tier-2 site

Oxford Site Report HEPSYSMAN

Статус ГРИД-кластера ИЯФ СО РАН.

Small site approaches - Sussex

GridPP Tier1 Review Fabric

Presentation transcript:

UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012

UK Particle Physics Computing Structure One Tier 1 center - RAL 20 University Particle Physics Departments Most are part of a distributed Grid Tier 2 center All have some local (AKA Tier 3) computing SouthGrid is comprised of all the non London based sites in the South of the UK. Birmingham, Bristol, Cambridge, JET, Oxford, RAL PPD, Sussex and Warwick. SouthGrid April

UKI Tier-1 & Tier-2 contributions The UK is a large contributor to the EGI. Tier-1 accounts for ~27% Tier-2s share as below SouthGrid April

4 UK Tier 2 reported CPU – Historical View to present

SouthGrid April SouthGrid Sites Accounting as reported by APEL

Birmingham SouthGrid April GridPP4 and DRI Funds Capacity Upgrade 4 Dell C6145 chassis complete with 96 AMD6234 cores per chassis, and 60 TB of RAID storage. The C6145 servers will therefore contribute 384 cores at around 8 HS06 per core. This will double our number of job streams. Networking Infrastructure upgrades Connectivity to JANET has been improved by a dedicated 10GbE port on the West Midlands router, dedicated to the GridPP site. New Dell S4810 switches with 48 10GbE ports per switch to enhance the internal interconnection of the GridPP clusters in both Physics and in the University centre. Single GbE links on existing workers in Physics to be enhanced to dual GbE. 3 S4810 switches being deployed in Physics for current & future grid storage nodes, servers, and existing GridPP workers at 2x1GbE and at 10GbE. 2 S4810 switches to be deployed in Uni central gridPP clusters. The switches were purchased with sufficient fibre inserts and 1GbE RJ45 inserts as well as many 10GbE direct cables, for current and future needs over the next 4 or more years. Current capacity 3345HS06 195TB

Bristol Status StoRM SE with GPFS, 102TB “almost completely” full of CMS data Currently running StoRM 1.3 on SL4, plan to upgrade to 1.8 soon. Bristol has two clusters, both controlled by Physics. The university HPC clusters are currently being used. New Dell VM hosting node bought to run service VMs on, with help from Oxford. Recent changes New Cream ce’s front each cluster, one glite 3.2 and one using the new UMD release. (Installed by Kashif ) Glexec and Argus have not yet been installed. Improved 10G connectivity from the cluster and across campus, plus improved 1G switching for WNs. Current capacity 2247HS06 110TB SouthGrid April

Cambridge Status (Following recent upgrades) –CPU : 268 job slots –Most services glite 3.2, except the LCG-ce for the condor cluster. DPM v1.8.0 on of the DPM disk servers, SL5 Batch System – Condor 7.4.4, Torque Supported VOs: Mainly Atlas, LHCb and Camont With GridPP4 funds 40TB of disk space and 2 off dual six-core CPU servers (ie. 24 cores in total) have been purchased. The DRI grant allowed us to upgrade our campus connection to 10Gbps. We were also able to install dedicated 10Gbps fibre to our GRID Room, 10GBE switches to support our disk servers and head nodes and 10GBE interconnects to new 1GBE switches for the worker nodes. We also enhanced our UPS capability to increase the protection for our GRID network and central servers. SouthGrid April Current capacity 2700HS06 295TB

JET SouthGrid April Essentially a pure CPU site All service nodes have been upgraded to glite3.2, with CREAM ce’s. SE is now 10.5TB Aim is to enable the site for Atlas production work, but the Atlas s/w will be easier to manage if we setup CVMFS. New server purchased to be the virtual machine server for various nodes, (including the squid required for CVMFS) Oxford will help JET do this. Current capacity 1772HS TB

RALPP 2056 CPU cores, HS TB disk We now run purely CreamCEs: 1 * glite 3.2 on a VM (soon to be retired), 2 * UMD (though at time of writing, one doesn’t seem to be publishing properly). Lately a lot of problems with CE stability, as per discussions on the various mailing lists. Batch system is still Torque from glite 3.1, but we will soon bring up an EMI/UMD torque to replace it (currently installed for test). SE is dCache – planning to upgrade to in the near future. GridPP4 purchases were: 9 * Viglen/Supermicro Twin^2 boxes (i.e. 36 nodes) each with 2 * Xeon E5645 CPUs, which will be configured to use some of the available hyperthreads (18 job slots per node) to provide a total of approximately 6207 HS06. 5 * Viglen/Supermicro storage nodes, each providing 40TB pool storage, for a total of 200TB. DRI funds enabled the purchase of 6 * Force10 s4810 switches (plus optics, etc) plus 10Gb network cards which will allow us to bring our older storage nodes to 10Gb networking. 2 of the new switches will form a routing layer above our core network. After testing hyper-threading at various levels we are now increasing the number of job slots available on hyper-thread capable worker nodes to use 50% of the available hyper-threads. See SouthGrid April Current capacity 26409HS06 980TB

Sussex Sussex has a significant local ATLAS group, their system (Tier 3) is designed for the high IO bandwidth patterns that ATLAS analysis can generate. Recent spending at Sussex on the HPC infrastructure from GridPP, DRI and internal EPP budgets has been as follows: 4 R510 OSS's to expand lustre by 63TB 18 infiniband cards for the OSS's and for upcoming CPU spend 4 36-port infiniband switches Integrated all sub-clusters at Sussex into one unified whole. Set up a fat-tree infiniband topology using the extra switches. Better IB routing throughout the cluster and more ports for expansion. Users are reporting that the cluster is running faster. The total capacity of the cluster (used for the entire university) is now ~500 cores (equivalent to intel x5650) with about 12 hepsepc06 per core. (6000HS06) We have 144TB of lustre filespace in total, again shared by the entire university. Tier 2 Progress The Sussex site will become a Tier2 shortly. The bdii, cream-ce, storm, CVMFS and apel services are now working correctly. We should be able to get online by early next week. Initially, we are going to restrict grid jobs to using only a small part of the cluster - 24 job slots, with 50TB of disk space allocated to the grid. Any future disk spend will all be allocated to the grid as we now have enough available for internal needs for the foreseeable future. Also, most of the future EPP CPU spend will be allocated to the grid, and due to the unified cluster, the grid will be able to backfill to any spare capacity on CPU added by other departments SouthGrid April

Oxford Significant upgrades over the last year to enhance the performance and capacity of the storage. A move to smaller faster Dell 510 servers (12*2TB raw capacity). 14 installed during Autumn 2011and a further 5 this Spring. A mixture of Intel (Dell 6100) and AMD CPU worker nodes have been installed. Four AMD 6276 Interlargos 16 core CPU’s on each of the two motherboards in the new Dell C6145’s. 4 servers provides 512 job slots. New total capacity will be ~700TB and 1360cores with a total of 11500HS06. SouthGrid April Current capacity 8961HS06 620TB

Network Upgrades Current University backbone is 10Gbit, so additional links are being installed to provide a 10Gbit path from the Grid and Tier 3 clusters to the JANET router. In due course this will allow full 10Gbit throughput without saturation of the links used by the rest of the University. Storage servers and high core count WNs will be connected to Force 10 s4810s, with older WNs connected at gigabit. The initial 1Gbit link from the computer room was upgraded to 2 Gbit as an interim measure in Autumn Currently peak usage is 1.6Mbit/s. SouthGrid April

CMS Tier 3 –Supported by RALPPD’s PhEDEx server –Useful for CMS, and for us, keeping the site busy in quiet times –However can block Atlas jobs and during accounting period not so desirable ALICE Support –There is a need to supplement the support given to ALICE by Birmingham. –Made sense to keep this in SouthGrid so Oxford have deployed an ALICE VO box –Site being configured by Kashif in conjunction with Alice support UK Regional Monitoring –Kashif runs the nagios based WLCG monitoring on the servers at Oxford –These include the Nagios server itself, and support nodes for it, SE, MyProxy and WMS/LB –The WMS is an addition to help the UK NGS migrate their testing. –There are very regular software updates for the WLCG Nagios monitoring, ~6 so far this year. Early Adopters –Take part in the testing of CREAM, ARGUS and torque_utils. Have accepted and provided a report for every new version of CREAM this year. SouthGrid Support –Providing support for Bristol and JET –Landslides support at Oxford and Bristol –Helping bring Sussex onto the Grid, (Been too busy in recent months though) Other Oxford Work SouthGrid April

SouthGrid April Conclusions SouthGrid sites utilisation generally improving, but some sites small compared with others. Recent hardware purchases will provide both capacity and performance (Infrastructure) improvements. Enabling of CVMFS at JET should allow Atlas Production jobs to run there to soak up the spare CPU cycles Sussex in the process of being certified as a Grid site.