Download presentation
Presentation is loading. Please wait.
Published byWilfrid Price Modified over 9 years ago
1
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010
2
SouthGrid September 2009 2 UK Tier 2 reported CPU – Historical View to present
3
SouthGrid September 2009 3 SouthGrid Sites Accounting as reported by APEL Sites Upgrading to SL5 and recalibration of published SI2K values RALPP seem low, even after my compensation for publishing 1000 instead of 2500
4
SouthGrid September 2009 4 Site Resources HEPSPEC06 CPU (kSI2K) converted from HEPSPEC06 benchmarksStorage (TB) 17724421.5 3344836114 1836459110 1772443140 3332833160 129283232633 0 2498462461158.5 Site EDFA-JET Birmingham Bristol Cambridge Oxford RALPPD Totals
5
SouthGrid September 2009 5 Gridpp3 h/w generated MoU for 2012 Birmingham Bristol Cambridge Oxford RALPPD EDFA-JET Totals TB+JETHS06+JETKSI2K+JET 1242724681 351429357 1741738434 3282974744 583165154129 000
6
SouthGrid September 2009 6 JET Stable operation, (SL5 WNs) quiet period over Christmas Could handle more opportunistic LHC work 1772HS06 1.5TB
7
SouthGrid September 2009 7 Birmingham WN’s been SL5 since Christmas. Have an ARGUS server (no SCAS). glexec_wn installed on the local worker nodes (not the shared cluster), so we *should* be able to run multiuser pilot jobs, but this has yet to be tested. We also have a Cream CE in production, but this does not use the ARGUS server yet Some problems on the HPC, GPFS has been unreliable. Recent increases in CPU to 3344 HS06 and Disk to 114TB
8
SouthGrid September 2009 8 Bristol SL4 WN & ce01 retired, VMs ce03+04 brought online, giving 132 SL5 jobslots Metson Cluster also being used to evaluate Hadoop More than doubled in CPU size to 1836 HS06 Disk at 110 TB Storm will be upgraded to 1.5 once the SL5 version is stable
9
SouthGrid September 2009 9 Cambridge 32 new cpu cores added to bring total up to 1772 HS06 40TB of disk storage recently added bringing the total to 140TB All WNs upgraded to SL5, now investigating glexec Atlas production making good use of this site
10
SouthGrid September 2009 10 RALPP Largest SouthGrid site APEL accounting discrepancy now seems to be sorted. There was a very hot GGUS ticket which has resulted in a Savannah bug, some corrections being made, and accounting should now have been corrected. Air conditioning woes caused a load of emergency downtime. We're expecting some more downtimes (due to further a/c, power and BMS issues) but these will be more tightly planned and managed. Currently running on <50% CPU due to a/c issues, some suspect disks on WNs, some WNs awaiting rehoming in other machine room. Memory upgrades for 40 WNs, so we will have either 320 job slots @ 4GB/slot or a smaller number of slots with higher memory. These are primarily intended for Atlas simulation work. A (fairly) modest amount of extra CPU and disk purchased at end of year coming online soonish.
11
SouthGrid September 2009 11 Oxford AC failure on 23 rd December worst possible time –System more robust now –Better monitoring –Auto shutdown scripts (based on ipmi system temperature monitoring) Following SL5 upgrade Autumn 09 cluster running very well. Faulty Network switch had been causing various timeouts, replaced. DPM reorganisation completed quickly once network fixed. Atlas Squid server installed Preparing tender to purchase h/w with the 2 nd tranche of gridpp3 money
12
SouthGrid September 2009 12 Grid Cluster setup SL5 Worker Nodes T2ce04 LCG-ce T2ce05 LCG-ce t2torque02 T2wn40T2wn5xT2wn6xT2wn7xT2wn8xT2wn85 Glite 3.2 SL5 Oxford
13
SouthGrid September 2009 13 Grid Cluster setup CREAM ce & pilot setup t2ce02 CREAM Glite 3.2 SL5 T2wn41 glexec enabled t2scas01 t2ce06 CREAM Glite 3.2 SL5 T2wn40 -87 Oxford
14
SouthGrid September 2009 14 Grid Cluster setup NGS integration setup ngsce-test.oerc.ox.ac.uk ngs.oerc.ox.ac.uk wn40wn5xwn6xwn7xwn8x Oxford ngsce-test is a Virtual Machine which has glite ce software installed. The glite WN software is installed via a tar ball in an NFS shared area visible to all the WN’s. PBSpro logs are rsync’ed to ngsce-test to allow the APEL accounting to match which PBS jobs were grid jobs. Contributed 1.2% of Oxfords total work during Q1
15
SouthGrid September 2009 15 Operations Dashboard used by ROD is now Nagios based
16
SouthGrid September 2009 16 gridppnagios Oxford runs the UKI Regional Nagios monitoring site. The Operations dashboard will take information from this in due course. https://gridppnagios.physics.ox.ac.uk/nagios/ https://twiki.cern.ch/twiki/bin/view/LCG/Gr idServiceMonitoringInfo https://gridppnagios.physics.ox.ac.uk/nagios/ https://twiki.cern.ch/twiki/bin/view/LCG/Gr idServiceMonitoringInfo
17
SouthGrid September 2009 17 Oxford Tier-2 Cluster – Jan 2009 located at Begbroke. Tendering for upgrade. Decommissioned January 2009 Saving approx 6.6KW Originally installed April 04 17 th November 2008 Upgrade 26 servers = 208 Job Slots 60TB Disk 22 Servers = 176 Job Slots 100TB Disk Storage
18
SouthGrid September 2009 18 Grid Cluster Network setup 3com 5500 T2se0n – 20TB Disk Pool Node Worker Node 3com 5500 Backplane Stacking Cables 96Gbps full duplex T2se0n – 20TB Disk Pool Node T2se0n – 20TB Disk Pool Node Dual Channel bonded 1 Gbps links to the storage nodes Oxford 10 gigabit too expensive, so will maintain 1gigabit per ~10TB ratio with channel bonding in the new tender
19
SouthGrid September 2009 19 Production Mode Sites have to be more vigilant than ever. Closer monitoring Faster response to problems Proactive attitude to fixing problems before GGUS tickets arrive Closer interaction with Main Experimental Users Use the monitoring tools available:
20
SouthGrid September 2009 20 Atlas Monitoring
21
SouthGrid September 2009 21 PBSWEBMON
22
SouthGrid September 2009 22 Ganglia
23
SouthGrid September 2009 23 Command Line showq | more pbsnodes –l qstat –an ont2wns df –hl
24
SouthGrid September 2009 24 Local Campus Network Monitoring
25
SouthGrid September 2009 25 Gridmon
26
SouthGrid September 2009 26 Patch levels – Pakiti v1 vs v2
27
SouthGrid September 2009 27 Monitoring tools etc SitePakitiGangliaPbswebmonScas,glexec, argus JETNoYesNo BhamNoYesNoNo,yes,yes BristYes, v1YesNo CamNoYesNo OxV1 production, v2 test Yes Yes, yes, no RALPPV1YesNoNo (but started on scas)
28
SouthGrid September 2009 28 Conclusions SouthGrid sites utilisation improving Many had recent upgrades others putting out tenders Will be purchasing new hardware in gridpp3 second tranche Monitoring for production running improving
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.