SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009.

Southgrid Status Pete Gronbech: 21 st March 2007 GridPP 18 Glasgow.

SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.

LCG-France Project Status Fabio Hernandez Frédérique Chollet Fairouz Malek Réunion Sites LCG-France Annecy, May

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.

Birmingham site report Lawrie Lowe: System Manager Yves Coppens: SouthGrid support HEP System Managers’ Meeting, RAL, May 2007.

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.

UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.

Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.

Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.

Quarterly report SouthernTier-2 Quarter P.D. Gronbech.

D. Britton GridPP Status - ProjectMap 22/Feb/06. D. Britton22/Feb/2006GridPP Status GridPP2 ProjectMap.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

RAL PPD Site Update and other odds and ends Chris Brew.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

Winnie Lacesso Bristol Storage June DPM LCG Storage lcgse01 = DPM built in 2005 by Yves Coppens & Pete Gronbech SuperMicro X5DPAGG (Streamline.

BINP/GCF Status Report BINP LCG Site Registration Oct 2009

Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011.

12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

Quarterly report ScotGrid Quarter Fraser Speirs.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPIX 2009 Umea, Sweden 26 th May 2009.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30 th June 2009.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

Organisation Management and Policy Group (MPG): Responsible for setting and policy decisions and resolving any issues concerning fractional usage, acceptable.

Lucien Boland and Sean Crosby Research Computing.

Main title ERANET - HEP Group info (if required) Your name ….

Main title HEP in Greece Group info (if required) Your name ….

SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

WLCG Nagios and the NGS. We have a plan NGS is using a highly customised version of the (SDSC written) INCA monitoring framework. It was became too complicated.

Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.

GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.

Southgrid Technical Meeting Pete Gronbech: 26 th August 2005 Oxford.

UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.

Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Southgrid Technical Meeting Pete Gronbech: 24 th October 2006 Cambridge.

Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.

14th October 2010Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and SouthGrid Technical Co-ordinator.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.

Oxford & SouthGrid Update HEPiX Pete Gronbech GridPP Project Manager October 2015.

5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

NERC Lessons Learned Summary LLs Published in September 2015.

RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June

BaBar Cluster Had been unstable mainly because of failing disks Very few (

RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.

IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.

J Jensen/J Gordon RAL Storage Storage at RAL Service Challenge Meeting 27 Jan 2005.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

2007/05/22 Integration of virtualization software Pierre Girard ATLAS 3T1 Meeting

Cambridge Site Report John Hill 20 June 20131SouthGrid Face to Face.

UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.

Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

Pete Gronbech GridPP Project Manager April 2016

Update on Plan for KISTI-GSDC

Oxford Site Report HEPSYSMAN

Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford

Pete Gronbech, Kashif Mohammad and Vipul Davda

Presentation transcript:

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL

UK Tier 2 reported CPU – Historical View to Q109

UK Tier 2 reported CPU – Q1 2009

SouthGrid Sites Accounting as reported by APEL

Job distribution

Site Upgrades since gridpp21 RALPPD Increase of 640 cores (1568KSI2K) +380TB Cambridge 32 cores (83KSI2K) + 20TB Birmingham 64 cores on pp cluster and 128 cores HPC cluster which add ~430KSI2K Bristol original cluster replaced by new quad cores systems 16 cores + increased share of the HPC cluster 53KSI2k + 44TB Oxford extra 208 cores 540KSI2K + 60TB Jet extra 120 cores 240KSI2K

New Total Q109 SouthGrid Totals RALPPD Oxford Cambridge Bristol Birmingham EDFA-JET Storage (TB)CPU (kSI2K) GridPP

MoU % of MoU CPU% of MoU Disk %142.86% 96.77%343.75% %230.77% %363.64% %374.56% %314.31%

Where are other T2’s benefiting compared to SouthGrid

January 2009 so far CMS only at Bristol and RAL esr, hone, ilc, pheno, supernemo and southgrid contribute KSI2k hours to Oxford

Network rate capping Oxford recently had its network link rate capped to 100mbs This was as a result of continuous mbs traffic caused by CMS commissioning testing. –As it happens this test completed at the same time as we were capped, so we passed the test, and current normal use is not expected to be this high Oxfords Janet link is actually 2*1gbit links which had become saturated. Short term solution is to only rate cap JANET traffic to 200mbs, all other on site traffic remains at 1gbs. Long term plan is to upgrade the JANET link to 10gbs within the year.

spec benchmarking Purchased the SPEC 2006 benchmark suite Ran using the Hepix scripts to run the HEPspec06 way Using the HEP spec benchmark should provide a level playing field. In the past sites could choose any one of the many published values on the spec benchmark site.

Staff Changes Jon Waklin and Yves Coppens left in Feb 09 Kashif Mohammad started in Jan 09 as the deputy coordinator for SouthGrid. Chris Curtis will replace Yves starting in May. He is currently doing his PhD on the Atlas project. The Bristol post will be advertised, it is jointly funded by IS and GridPP.

gridppnagios

Resilience What do we mean by resilience? The ability to maintain high availability and reliability of our grid service Guard against failures –Hardware –Software

Availability / Reliability

Hardware Failures The hardware –Critical Servers Good quality equipment Dual PSU Dual mirrored systems disks and RAID for storage arrays All systems have 3 year maintenance with on site spares pool. (disks, psu’s, ipmi cards) Similar kit bought for servers so can swap h/w. IPMI cards allow remote operation and control –The environment UPS for critical servers Network connected PDU’s for monitoring and power switching Professional Computer room / rooms Air Conditioning: Need to monitor the temperature Actions based on the above environmental monitoring –Configure your UPS to shutdown systems in the event of sustained power loss –Shutdown cluster in the event of high temperature

Hardware continued So having guarded against the h/w failing if it does then we need to ensure rapid replacement Restore from backups or reinstall Automated installation system; –pxe, kickstart, cfengine –Good documentation Duplication of Critical Servers –Multiple ce’s –Virtualisation of some services allows migration to alternative VM servers (mon, bdii, and ce’s) –Less reliance on external services Could setup Local WMS, Top level BDII

Software Failures Main cause of loss of availability is software failures –Miss configuration –Fragility of glite middleware –OS system problems Disks filling up Service failures (eg ntp) Good communications can help solve problems quickly. –Mailing lists, wikis, blogs, meetings, –Good monitoring and alerting (Nagios etc) –Learn from mistakes. Update systems and procedures to prevent reoccurrence.

Recent example Many SAM failures occasional passes All test jobs pass Almost all ATLAS jobs pass Error logs revealed messages about proxy not being valid yet! ntp on se head node had stopped AND cfengine had been switched off on that node (so no automatic check and restart) SAM test always gets a new proxy and if it got through the WMS and on to our cluster in to a reserved express queue slot within 4 mins would fail. In this case the SAM tests were not accurately reflecting the usability of our cluster BUT it was showing a real problem.

Conclusions These systems are extremely complex Automatic configuration and good monitoring can help but systems need careful tending Sites should adopt best practice and learn from others We are improving but its an ongoing task