Edinburgh (ECDF) Update Wahid Bhimji On behalf of the ECDF Team HepSysMan ,30th June 2011 Edinburgh Setup Hardware Last year running Current Issues June-11 Hepsysman Wahid Bhimji - ECDF
Edinburgh Setup Group computing: Grid Computing: Managed centrally within physics dept. Desktops: SL5 - rest of physics moving to SL6 Shared storage ~ O(10) TB. Now using ManageTier3SW / AtlasLocalROOTBase https://twiki.atlas-canada.ca/bin/view/AtlasCanada/ManageTier3SW to provide gLite; atlas releases; ROOT ; pathena; ganga; VERY easy. Grid Computing: ECDF - Edinburgh Compute and Data Facility June-11 Hepsysman Wahid Bhimji - ECDF
What is ECDF? Edinburgh Compute and Data Facility University wide shared resource GridPP have a fairshare but run much more due to quiet periods Cluster maintained by central systems team. Griddy extras maintained by: Wahid Bhimji – Storage Support (~0.2 FTE) Andrew Washbrook – Middleware Support (~0.5++) ECDF Systems Team (~0.3) Steve Thorn - Middleware Support (~0.1) In Physics In IS Dept. June-11 Hepsysman Wahid Bhimji - ECDF
ECDF Resources – Compute Upgrades Eddie Mk2 Two upgrade phases – last june and last month Last Deployment 8th - 16th June Main GPFS filesystem offline for 8 days Alternative running mode for GridPP middleware without GPFS dependencies enabled us to have exclusive use of ECDF cluster during this Previous nodes now switched off to save power. Previous Phase I (last june) Phase 2 (in now) Nodes 246 128 156 Cores 1456 1024 1872 SpecInt 19515 27008 42558 New Compute Nodes IBM iDataplex DX360 M3 CPU 2 x Intel Westmere E5620 quad-core (phase1) E5645 six core (phase 2) 68 connected with a QDR Infiniband Memory 24GB DDR3 June-11 Hepsysman Wahid Bhimji - ECDF
Network (shared) cluster on own subnet WAN: 10Gb uplink from Eddie to SRIF 20Gb SRIF@Bush to SRIF@KB 4Gb to SRIF@KB to SRIF@AT 1Gb from SRIF@AT to SPOP2 - weakest link but dedicated; not saturating and could be upgraded 10Gb from SPOP2 to SJ5 Monitoring: SRIF@AT to SPOP2 http://skiddaw.ucs.ed.ac.uk/ June-11 Hepsysman Wahid Bhimji - ECDF
ECDF Resources: special services HIGH MEM: 24-core 4-socket Intel X7542 system with 512GB GPUS: One Tesla s1070 box hosting four C1060 Tesla GPUs One M2050 box hosting eight C2050 Tesla GPUs WHOLE NODE Q: UKI-SCOTGRID-ECDF_8CORE ATLAS queue enabled and recieving "wholenode" AthenaMP jobs for some time. June-11 Hepsysman Wahid Bhimji - ECDF
ECDF Storage Cluster storage: 176 TB, GPFS GridPP Bulk Storage Some available to GridPP via StoRM (not used) Meta Data held on SSDs and some 15k Drives for frequently accessed data GridPP Bulk Storage 3 * (Dell R610 + 3 * MD1200) =~ 200 TB DPM - (1.8.0x) Test EMI 1.0 DPM (currently on separate server) June-11 Hepsysman Wahid Bhimji - ECDF
Last Years Performance Availability Q4 2010: 100% Q1 2011: middleware machine room refurbishment. Q2 2011: Recent downtime for power work at main cluster site Production Now routinely run >1500 ATLAS prod jobs (if they choose to send them Well over our designated ECDF share LHCb production jobs also running better in 2011 Analysis Hammerclouds validated site Accepted into PD2P. Running up to 500 analysis jobs June-11 Hepsysman Wahid Bhimji - ECDF
Previous year’s issues GPFS grief (on software and home dirs) All SW areas moved to NFS SSD Metadata seems to have also helped with spurious errors Moved homedirs to NFS for GPFS outage this month (to provision new nodes) – may not move back. CA Sam Test Timeouts - doesn’t happen anymore (maybe GPFS metadata) Older lcg-ce load goes mental from time to time CMS jobs? globus-gatekeeper and forks ? Not problematic anymore – moving to cream anyway. June-11 Hepsysman Wahid Bhimji - ECDF
Current issues Occasional ATLAS SW install nightmares Have to back out base release and reinstall Sometimes permissions not group writable and job is mapped to another sgmuser Sometimes an “install db issue” not really explained CVMFS obviously an answer but involves a systems team persuasion Orphaned Processes: gcc –v All VOs Are there any instances of CVMFS causing WN performance problems/outages? June-11 Hepsysman Wahid Bhimji - ECDF
Current issues: Cream Now have 2 CREAM Ces (one on VM) Poor performance seen on VM instance: Issues observed similar to Lancaster JobDBAdminPurger.sh has to be frequently run to remove cancelled jobs http://grid.pd.infn.it/cream/field.php?n=Main.HowToPurgeJobsFromTheCREAMDB - Too many "leases" being created by the atlas pilot factories - bupdater_loop_interval set in our blah.config was too low - recommended to reduce purge_interval parameter in blah.config BUpdaterSGE consumes a lot of CPU caused by large amount of qstat operations through sge_helper process. (fixed currently being developed) high memory usage for 50 blahpd process causes system to swap (see https://savannah.cern.ch/bugs/?75854) June-11 Hepsysman Wahid Bhimji - ECDF
Cream – more issues glexec authZ errors observed for a limited amount of grid users (including a SAM test user which affects our perceived availability) aside: why is ATLAS availability metric measured by WMS and delivery measured by panda jobs? Debugging still ongoing, but assuming configuration error for now service gLite restart does not work. tomcat processes do not get killed in time, has to be done manually could force 10 second sleep in service restart script (suggested by Kashif) Q: Is it only us having these issues? Cream tuning recommendations? Hardware? Need to resolve these before can retire lcg-CE (we want that as much as anyone) June-11 Hepsysman Wahid Bhimji - ECDF
Glexec Systems team could be sympathetic to installing something on WN for our use only. But it cannot be a buggy, beta-version, suid exe Once glexec is stable and has no significant bugs then we can consider approaching them Orphaned processes not acceptable Relocatable install preferable Presumably not going to be available by today…. Provisioning ARGUS anyway in preparation. June-11 Hepsysman Wahid Bhimji - ECDF
Conclusions ECDF running well and good VfM for GridPP Significant increases in compute resources Could do with some more disk (and switches) A few CREAM issues: working though them. Glexec…… June-11 Hepsysman Wahid Bhimji - ECDF