SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL

UK Tier 2 reported CPU – Historical View to Q109

UK Tier 2 reported CPU – Q1 2009

SouthGrid Sites Accounting as reported by APEL

Job distribution

Site Upgrades since gridpp21 RALPPD Increase of 640 cores (1568KSI2K) +380TB Cambridge 32 cores (83KSI2K) + 20TB Birmingham 64 cores on pp cluster and 128 cores HPC cluster which add ~430KSI2K Bristol original cluster replaced by new quad cores systems 16 cores + increased share of the HPC cluster 53KSI2k + 44TB Oxford extra 208 cores 540KSI2K + 60TB Jet extra 120 cores 240KSI2K

New Total Q109 SouthGrid 999.55545 6332815 160972 60455 55120 90700 1.5483 Totals RALPPD Oxford Cambridge Bristol Birmingham EDFA-JET Storage (TB)CPU (kSI2K) GridPP

MoU % of MoU CPU% of MoU Disk 304.35%142.86% 96.77%343.75% 469.07%230.77% 592.68%363.64% 329.63%374.56% 377.47%314.31%

Where are other T2’s benefiting compared to SouthGrid

January 2009 so far CMS only at Bristol and RAL esr, hone, ilc, pheno, supernemo and southgrid contribute 17755 KSI2k hours to Oxford

Network rate capping Oxford recently had its network link rate capped to 100mbs This was as a result of continuous 300-350mbs traffic caused by CMS commissioning testing. –As it happens this test completed at the same time as we were capped, so we passed the test, and current normal use is not expected to be this high Oxfords Janet link is actually 2*1gbit links which had become saturated. Short term solution is to only rate cap JANET traffic to 200mbs, all other on site traffic remains at 1gbs. Long term plan is to upgrade the JANET link to 10gbs within the year.

spec benchmarking Purchased the SPEC 2006 benchmark suite Ran using the Hepix scripts to run the HEPspec06 way Using the HEP spec benchmark should provide a level playing field. In the past sites could choose any one of the many published values on the spec benchmark site.

Staff Changes Jon Waklin and Yves Coppens left in Feb 09 Kashif Mohammad started in Jan 09 as the deputy coordinator for SouthGrid. Chris Curtis will replace Yves starting in May. He is currently doing his PhD on the Atlas project. The Bristol post will be advertised, it is jointly funded by IS and GridPP.

gridppnagios

Resilience What do we mean by resilience? The ability to maintain high availability and reliability of our grid service Guard against failures –Hardware –Software

Availability / Reliability

Hardware Failures The hardware –Critical Servers Good quality equipment Dual PSU Dual mirrored systems disks and RAID for storage arrays All systems have 3 year maintenance with on site spares pool. (disks, psu’s, ipmi cards) Similar kit bought for servers so can swap h/w. IPMI cards allow remote operation and control –The environment UPS for critical servers Network connected PDU’s for monitoring and power switching Professional Computer room / rooms Air Conditioning: Need to monitor the temperature Actions based on the above environmental monitoring –Configure your UPS to shutdown systems in the event of sustained power loss –Shutdown cluster in the event of high temperature

Hardware continued So having guarded against the h/w failing if it does then we need to ensure rapid replacement Restore from backups or reinstall Automated installation system; –pxe, kickstart, cfengine –Good documentation Duplication of Critical Servers –Multiple ce’s –Virtualisation of some services allows migration to alternative VM servers (mon, bdii, and ce’s) –Less reliance on external services Could setup Local WMS, Top level BDII

Software Failures Main cause of loss of availability is software failures –Miss configuration –Fragility of glite middleware –OS system problems Disks filling up Service failures (eg ntp) Good communications can help solve problems quickly. –Mailing lists, wikis, blogs, meetings, –Good monitoring and alerting (Nagios etc) –Learn from mistakes. Update systems and procedures to prevent reoccurrence.

Recent example Many SAM failures occasional passes All test jobs pass Almost all ATLAS jobs pass Error logs revealed messages about proxy not being valid yet! ntp on se head node had stopped AND cfengine had been switched off on that node (so no automatic check and restart) SAM test always gets a new proxy and if it got through the WMS and on to our cluster in to a reserved express queue slot within 4 mins would fail. In this case the SAM tests were not accurately reflecting the usability of our cluster BUT it was showing a real problem.

Conclusions These systems are extremely complex Automatic configuration and good monitoring can help but systems need careful tending Sites should adopt best practice and learn from others We are improving but its an ongoing task

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

Similar presentations

Presentation on theme: "SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

Similar presentations

Presentation on theme: "SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL."— Presentation transcript:

Similar presentations

About project

Feedback