Edinburgh (ECDF) Update

Edinburgh (ECDF) Update
Wahid Bhimji On behalf of the ECDF Team HepSysMan ,30th June 2011 Edinburgh Setup Hardware Last year running Current Issues June-11 Hepsysman Wahid Bhimji - ECDF

Edinburgh Setup Group computing: Grid Computing:
Managed centrally within physics dept. Desktops: SL5 - rest of physics moving to SL6 Shared storage ~ O(10) TB. Now using ManageTier3SW / AtlasLocalROOTBase to provide gLite; atlas releases; ROOT ; pathena; ganga; VERY easy. Grid Computing: ECDF - Edinburgh Compute and Data Facility June-11 Hepsysman Wahid Bhimji - ECDF

What is ECDF? Edinburgh Compute and Data Facility
University wide shared resource GridPP have a fairshare but run much more due to quiet periods Cluster maintained by central systems team. Griddy extras maintained by: Wahid Bhimji – Storage Support (~0.2 FTE) Andrew Washbrook – Middleware Support (~0.5++) ECDF Systems Team (~0.3) Steve Thorn - Middleware Support (~0.1) In Physics In IS Dept. June-11 Hepsysman Wahid Bhimji - ECDF

ECDF Resources – Compute Upgrades
Eddie Mk2 Two upgrade phases – last june and last month Last Deployment 8th - 16th June Main GPFS filesystem offline for 8 days Alternative running mode for GridPP middleware without GPFS dependencies enabled us to have exclusive use of ECDF cluster during this Previous nodes now switched off to save power. Previous Phase I (last june) Phase 2 (in now) Nodes 246 128 156 Cores 1456 1024 1872 SpecInt 19515 27008 42558 New Compute Nodes IBM iDataplex DX360 M3 CPU 2 x Intel Westmere E5620 quad-core (phase1) E5645 six core (phase 2) 68 connected with a QDR Infiniband Memory 24GB DDR3 June-11 Hepsysman Wahid Bhimji - ECDF

Network (shared) cluster on own subnet WAN:
10Gb uplink from Eddie to SRIF 20Gb to 4Gb to to 1Gb from to SPOP2 - weakest link but dedicated; not saturating and could be upgraded 10Gb from SPOP2 to SJ5 Monitoring: to SPOP2 June-11 Hepsysman Wahid Bhimji - ECDF

ECDF Resources: special services
HIGH MEM: 24-core 4-socket Intel X7542 system with 512GB GPUS: One Tesla s1070 box hosting four C1060 Tesla GPUs One M2050 box hosting eight C2050 Tesla GPUs WHOLE NODE Q: UKI-SCOTGRID-ECDF_8CORE ATLAS queue enabled and recieving "wholenode" AthenaMP jobs for some time. June-11 Hepsysman Wahid Bhimji - ECDF

ECDF Storage Cluster storage: 176 TB, GPFS GridPP Bulk Storage
Some available to GridPP via StoRM (not used) Meta Data held on SSDs and some 15k Drives for frequently accessed data GridPP Bulk Storage 3 * (Dell R * MD1200) =~ 200 TB DPM - (1.8.0x) Test EMI 1.0 DPM (currently on separate server) June-11 Hepsysman Wahid Bhimji - ECDF

Last Years Performance
Availability Q4 2010: 100% Q1 2011: middleware machine room refurbishment. Q2 2011: Recent downtime for power work at main cluster site Production Now routinely run >1500 ATLAS prod jobs (if they choose to send them  Well over our designated ECDF share LHCb production jobs also running better in 2011 Analysis Hammerclouds validated site Accepted into PD2P. Running up to 500 analysis jobs June-11 Hepsysman Wahid Bhimji - ECDF

Previous year’s issues
GPFS grief (on software and home dirs) All SW areas moved to NFS SSD Metadata seems to have also helped with spurious errors Moved homedirs to NFS for GPFS outage this month (to provision new nodes) – may not move back. CA Sam Test Timeouts - doesn’t happen anymore (maybe GPFS metadata) Older lcg-ce load goes mental from time to time CMS jobs? globus-gatekeeper and forks ? Not problematic anymore – moving to cream anyway. June-11 Hepsysman Wahid Bhimji - ECDF

Current issues Occasional ATLAS SW install nightmares
Have to back out base release and reinstall Sometimes permissions not group writable and job is mapped to another sgmuser Sometimes an “install db issue” not really explained CVMFS obviously an answer but involves a systems team persuasion Orphaned Processes: gcc –v All VOs Are there any instances of CVMFS causing WN performance problems/outages? June-11 Hepsysman Wahid Bhimji - ECDF

Current issues: Cream Now have 2 CREAM Ces (one on VM)
Poor performance seen on VM instance: Issues observed similar to Lancaster JobDBAdminPurger.sh has to be frequently run to remove cancelled jobs - Too many "leases" being created by the atlas pilot factories - bupdater_loop_interval set in our blah.config was too low - recommended to reduce purge_interval parameter in blah.config BUpdaterSGE consumes a lot of CPU caused by large amount of qstat operations through sge_helper process. (fixed currently being developed) high memory usage for 50 blahpd process causes system to swap (see June-11 Hepsysman Wahid Bhimji - ECDF

Cream – more issues  glexec authZ errors observed for a limited amount of grid users (including a SAM test user which affects our perceived availability) aside: why is ATLAS availability metric measured by WMS and delivery measured by panda jobs? Debugging still ongoing, but assuming configuration error for now service gLite restart does not work. tomcat processes do not get killed in time, has to be done manually could force 10 second sleep in service restart script (suggested by Kashif) Q: Is it only us having these issues? Cream tuning recommendations? Hardware? Need to resolve these before can retire lcg-CE (we want that as much as anyone) June-11 Hepsysman Wahid Bhimji - ECDF

Glexec Systems team could be sympathetic to installing something on WN for our use only. But it cannot be a buggy, beta-version, suid exe Once glexec is stable and has no significant bugs then we can consider approaching them Orphaned processes not acceptable Relocatable install preferable Presumably not going to be available by today…. Provisioning ARGUS anyway in preparation. June-11 Hepsysman Wahid Bhimji - ECDF

Conclusions ECDF running well and good VfM for GridPP
Significant increases in compute resources Could do with some more disk (and switches) A few CREAM issues: working though them. Glexec…… June-11 Hepsysman Wahid Bhimji - ECDF

Edinburgh (ECDF) Update

Similar presentations

Presentation on theme: "Edinburgh (ECDF) Update"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Edinburgh (ECDF) Update

Similar presentations

Presentation on theme: "Edinburgh (ECDF) Update"— Presentation transcript:

Similar presentations

About project

Feedback