UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010
SouthGrid September UK Tier 2 reported CPU – Historical View to present
SouthGrid September SouthGrid Sites Accounting as reported by APEL Sites Upgrading to SL5 and recalibration of published SI2K values RALPP seem low, even after my compensation for publishing 1000 instead of 2500
SouthGrid September Site Resources HEPSPEC06 CPU (kSI2K) converted from HEPSPEC06 benchmarksStorage (TB) Site EDFA-JET Birmingham Bristol Cambridge Oxford RALPPD Totals
SouthGrid September Gridpp3 h/w generated MoU for 2012 Birmingham Bristol Cambridge Oxford RALPPD EDFA-JET Totals TB+JETHS06+JETKSI2K+JET
SouthGrid September JET Stable operation, (SL5 WNs) quiet period over Christmas Could handle more opportunistic LHC work 1772HS06 1.5TB
SouthGrid September Birmingham WN’s been SL5 since Christmas. Have an ARGUS server (no SCAS). glexec_wn installed on the local worker nodes (not the shared cluster), so we *should* be able to run multiuser pilot jobs, but this has yet to be tested. We also have a Cream CE in production, but this does not use the ARGUS server yet Some problems on the HPC, GPFS has been unreliable. Recent increases in CPU to 3344 HS06 and Disk to 114TB
SouthGrid September Bristol SL4 WN & ce01 retired, VMs ce03+04 brought online, giving 132 SL5 jobslots Metson Cluster also being used to evaluate Hadoop More than doubled in CPU size to 1836 HS06 Disk at 110 TB Storm will be upgraded to 1.5 once the SL5 version is stable
SouthGrid September Cambridge 32 new cpu cores added to bring total up to 1772 HS06 40TB of disk storage recently added bringing the total to 140TB All WNs upgraded to SL5, now investigating glexec Atlas production making good use of this site
SouthGrid September RALPP Largest SouthGrid site APEL accounting discrepancy now seems to be sorted. There was a very hot GGUS ticket which has resulted in a Savannah bug, some corrections being made, and accounting should now have been corrected. Air conditioning woes caused a load of emergency downtime. We're expecting some more downtimes (due to further a/c, power and BMS issues) but these will be more tightly planned and managed. Currently running on <50% CPU due to a/c issues, some suspect disks on WNs, some WNs awaiting rehoming in other machine room. Memory upgrades for 40 WNs, so we will have either 320 job 4GB/slot or a smaller number of slots with higher memory. These are primarily intended for Atlas simulation work. A (fairly) modest amount of extra CPU and disk purchased at end of year coming online soonish.
SouthGrid September Oxford AC failure on 23 rd December worst possible time –System more robust now –Better monitoring –Auto shutdown scripts (based on ipmi system temperature monitoring) Following SL5 upgrade Autumn 09 cluster running very well. Faulty Network switch had been causing various timeouts, replaced. DPM reorganisation completed quickly once network fixed. Atlas Squid server installed Preparing tender to purchase h/w with the 2 nd tranche of gridpp3 money
SouthGrid September Grid Cluster setup SL5 Worker Nodes T2ce04 LCG-ce T2ce05 LCG-ce t2torque02 T2wn40T2wn5xT2wn6xT2wn7xT2wn8xT2wn85 Glite 3.2 SL5 Oxford
SouthGrid September Grid Cluster setup CREAM ce & pilot setup t2ce02 CREAM Glite 3.2 SL5 T2wn41 glexec enabled t2scas01 t2ce06 CREAM Glite 3.2 SL5 T2wn Oxford
SouthGrid September Grid Cluster setup NGS integration setup ngsce-test.oerc.ox.ac.uk ngs.oerc.ox.ac.uk wn40wn5xwn6xwn7xwn8x Oxford ngsce-test is a Virtual Machine which has glite ce software installed. The glite WN software is installed via a tar ball in an NFS shared area visible to all the WN’s. PBSpro logs are rsync’ed to ngsce-test to allow the APEL accounting to match which PBS jobs were grid jobs. Contributed 1.2% of Oxfords total work during Q1
SouthGrid September Operations Dashboard used by ROD is now Nagios based
SouthGrid September gridppnagios Oxford runs the UKI Regional Nagios monitoring site. The Operations dashboard will take information from this in due course. idServiceMonitoringInfo idServiceMonitoringInfo
SouthGrid September Oxford Tier-2 Cluster – Jan 2009 located at Begbroke. Tendering for upgrade. Decommissioned January 2009 Saving approx 6.6KW Originally installed April th November 2008 Upgrade 26 servers = 208 Job Slots 60TB Disk 22 Servers = 176 Job Slots 100TB Disk Storage
SouthGrid September Grid Cluster Network setup 3com 5500 T2se0n – 20TB Disk Pool Node Worker Node 3com 5500 Backplane Stacking Cables 96Gbps full duplex T2se0n – 20TB Disk Pool Node T2se0n – 20TB Disk Pool Node Dual Channel bonded 1 Gbps links to the storage nodes Oxford 10 gigabit too expensive, so will maintain 1gigabit per ~10TB ratio with channel bonding in the new tender
SouthGrid September Production Mode Sites have to be more vigilant than ever. Closer monitoring Faster response to problems Proactive attitude to fixing problems before GGUS tickets arrive Closer interaction with Main Experimental Users Use the monitoring tools available:
SouthGrid September Atlas Monitoring
SouthGrid September PBSWEBMON
SouthGrid September Ganglia
SouthGrid September Command Line showq | more pbsnodes –l qstat –an ont2wns df –hl
SouthGrid September Local Campus Network Monitoring
SouthGrid September Gridmon
SouthGrid September Patch levels – Pakiti v1 vs v2
SouthGrid September Monitoring tools etc SitePakitiGangliaPbswebmonScas,glexec, argus JETNoYesNo BhamNoYesNoNo,yes,yes BristYes, v1YesNo CamNoYesNo OxV1 production, v2 test Yes Yes, yes, no RALPPV1YesNoNo (but started on scas)
SouthGrid September Conclusions SouthGrid sites utilisation improving Many had recent upgrades others putting out tenders Will be purchasing new hardware in gridpp3 second tranche Monitoring for production running improving