Presentation is loading. Please wait.

Presentation is loading. Please wait.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.

Similar presentations


Presentation on theme: "UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010."— Presentation transcript:

1 UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

2 SouthGrid September 2009 2 UK Tier 2 reported CPU – Historical View to present

3 SouthGrid September 2009 3 SouthGrid Sites Accounting as reported by APEL Sites Upgrading to SL5 and recalibration of published SI2K values RALPP seem low, even after my compensation for publishing 1000 instead of 2500

4 SouthGrid September 2009 4 Site Resources HEPSPEC06 CPU (kSI2K) converted from HEPSPEC06 benchmarksStorage (TB) 17724421.5 3344836114 1836459110 1772443140 3332833160 129283232633 0 2498462461158.5 Site EDFA-JET Birmingham Bristol Cambridge Oxford RALPPD Totals

5 SouthGrid September 2009 5 Gridpp3 h/w generated MoU for 2012 Birmingham Bristol Cambridge Oxford RALPPD EDFA-JET Totals TB+JETHS06+JETKSI2K+JET 1242724681 351429357 1741738434 3282974744 583165154129 000

6 SouthGrid September 2009 6 JET Stable operation, (SL5 WNs) quiet period over Christmas Could handle more opportunistic LHC work 1772HS06 1.5TB

7 SouthGrid September 2009 7 Birmingham WN’s been SL5 since Christmas. Have an ARGUS server (no SCAS). glexec_wn installed on the local worker nodes (not the shared cluster), so we *should* be able to run multiuser pilot jobs, but this has yet to be tested. We also have a Cream CE in production, but this does not use the ARGUS server yet Some problems on the HPC, GPFS has been unreliable. Recent increases in CPU to 3344 HS06 and Disk to 114TB

8 SouthGrid September 2009 8 Bristol SL4 WN & ce01 retired, VMs ce03+04 brought online, giving 132 SL5 jobslots Metson Cluster also being used to evaluate Hadoop More than doubled in CPU size to 1836 HS06 Disk at 110 TB Storm will be upgraded to 1.5 once the SL5 version is stable

9 SouthGrid September 2009 9 Cambridge 32 new cpu cores added to bring total up to 1772 HS06 40TB of disk storage recently added bringing the total to 140TB All WNs upgraded to SL5, now investigating glexec Atlas production making good use of this site

10 SouthGrid September 2009 10 RALPP Largest SouthGrid site APEL accounting discrepancy now seems to be sorted. There was a very hot GGUS ticket which has resulted in a Savannah bug, some corrections being made, and accounting should now have been corrected. Air conditioning woes caused a load of emergency downtime. We're expecting some more downtimes (due to further a/c, power and BMS issues) but these will be more tightly planned and managed. Currently running on <50% CPU due to a/c issues, some suspect disks on WNs, some WNs awaiting rehoming in other machine room. Memory upgrades for 40 WNs, so we will have either 320 job slots @ 4GB/slot or a smaller number of slots with higher memory. These are primarily intended for Atlas simulation work. A (fairly) modest amount of extra CPU and disk purchased at end of year coming online soonish.

11 SouthGrid September 2009 11 Oxford AC failure on 23 rd December worst possible time –System more robust now –Better monitoring –Auto shutdown scripts (based on ipmi system temperature monitoring) Following SL5 upgrade Autumn 09 cluster running very well. Faulty Network switch had been causing various timeouts, replaced. DPM reorganisation completed quickly once network fixed. Atlas Squid server installed Preparing tender to purchase h/w with the 2 nd tranche of gridpp3 money

12 SouthGrid September 2009 12 Grid Cluster setup SL5 Worker Nodes T2ce04 LCG-ce T2ce05 LCG-ce t2torque02 T2wn40T2wn5xT2wn6xT2wn7xT2wn8xT2wn85 Glite 3.2 SL5 Oxford

13 SouthGrid September 2009 13 Grid Cluster setup CREAM ce & pilot setup t2ce02 CREAM Glite 3.2 SL5 T2wn41 glexec enabled t2scas01 t2ce06 CREAM Glite 3.2 SL5 T2wn40 -87 Oxford

14 SouthGrid September 2009 14 Grid Cluster setup NGS integration setup ngsce-test.oerc.ox.ac.uk ngs.oerc.ox.ac.uk wn40wn5xwn6xwn7xwn8x Oxford ngsce-test is a Virtual Machine which has glite ce software installed. The glite WN software is installed via a tar ball in an NFS shared area visible to all the WN’s. PBSpro logs are rsync’ed to ngsce-test to allow the APEL accounting to match which PBS jobs were grid jobs. Contributed 1.2% of Oxfords total work during Q1

15 SouthGrid September 2009 15 Operations Dashboard used by ROD is now Nagios based

16 SouthGrid September 2009 16 gridppnagios Oxford runs the UKI Regional Nagios monitoring site. The Operations dashboard will take information from this in due course. https://gridppnagios.physics.ox.ac.uk/nagios/ https://twiki.cern.ch/twiki/bin/view/LCG/Gr idServiceMonitoringInfo https://gridppnagios.physics.ox.ac.uk/nagios/ https://twiki.cern.ch/twiki/bin/view/LCG/Gr idServiceMonitoringInfo

17 SouthGrid September 2009 17 Oxford Tier-2 Cluster – Jan 2009 located at Begbroke. Tendering for upgrade. Decommissioned January 2009 Saving approx 6.6KW Originally installed April 04 17 th November 2008 Upgrade 26 servers = 208 Job Slots 60TB Disk 22 Servers = 176 Job Slots 100TB Disk Storage

18 SouthGrid September 2009 18 Grid Cluster Network setup 3com 5500 T2se0n – 20TB Disk Pool Node Worker Node 3com 5500 Backplane Stacking Cables 96Gbps full duplex T2se0n – 20TB Disk Pool Node T2se0n – 20TB Disk Pool Node Dual Channel bonded 1 Gbps links to the storage nodes Oxford 10 gigabit too expensive, so will maintain 1gigabit per ~10TB ratio with channel bonding in the new tender

19 SouthGrid September 2009 19 Production Mode Sites have to be more vigilant than ever. Closer monitoring Faster response to problems Proactive attitude to fixing problems before GGUS tickets arrive Closer interaction with Main Experimental Users Use the monitoring tools available:

20 SouthGrid September 2009 20 Atlas Monitoring

21 SouthGrid September 2009 21 PBSWEBMON

22 SouthGrid September 2009 22 Ganglia

23 SouthGrid September 2009 23 Command Line showq | more pbsnodes –l qstat –an ont2wns df –hl

24 SouthGrid September 2009 24 Local Campus Network Monitoring

25 SouthGrid September 2009 25 Gridmon

26 SouthGrid September 2009 26 Patch levels – Pakiti v1 vs v2

27 SouthGrid September 2009 27 Monitoring tools etc SitePakitiGangliaPbswebmonScas,glexec, argus JETNoYesNo BhamNoYesNoNo,yes,yes BristYes, v1YesNo CamNoYesNo OxV1 production, v2 test Yes Yes, yes, no RALPPV1YesNoNo (but started on scas)

28 SouthGrid September 2009 28 Conclusions SouthGrid sites utilisation improving Many had recent upgrades others putting out tenders Will be purchasing new hardware in gridpp3 second tranche Monitoring for production running improving


Download ppt "UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010."

Similar presentations


Ads by Google