Edinburgh (ECDF) Update Wahid Bhimji On behalf of the ECDF Team HepSysMan,10 th June 2010 June-10 Hepsysman1Wahid Bhimji - ECDF  Edinburgh Setup  Hardware.

Slides:

Advertisements

Similar presentations

ScotGrid Tier 2 Douglas McNab On behalf of the ScotGrid team.

Advertisements

Southgrid Status Pete Gronbech: 21 st March 2007 GridPP 18 Glasgow.

Storage Workshop Summary Wahid Bhimji University Of Edinburgh On behalf all of the participants…

RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.

User Board - Supporting Other Experiments Stephen Burke, RAL pp Glenn Patrick.

SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.

Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

UCL HEP Computing Status HEPSYSMAN, RAL,

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.

Birmingham site report Lawrie Lowe: System Manager Yves Coppens: SouthGrid support HEP System Managers’ Meeting, RAL, May 2007.

GSIAF "CAF" experience at GSI Kilian Schwarz. GSIAF Present status Present status installation and configuration installation and configuration usage.

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

Wahid Bhimji Andy Washbrook And others including ECDF systems team Not a comprehensive update but what ever occurred to me yesterday.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim

Cambridge Site Report Cambridge Site Report HEP SYSMAN, RAL th June 2010 Santanu Das Cavendish Laboratory, Cambridge Santanu.

SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.

Filesytems and file access Wahid Bhimji University of Edinburgh, Sam Skipsey, Chris Walker …. Apr-101Wahid Bhimji – Files access.

ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.

Winnie Lacesso Bristol Storage June DPM LCG Storage lcgse01 = DPM built in 2005 by Yves Coppens & Pete Gronbech SuperMicro X5DPAGG (Streamline.

David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\

BINP/GCF Status Report BINP LCG Site Registration Oct 2009

StoRM Some basics and a comparison with DPM Wahid Bhimji University of Edinburgh GridPP Storage Workshop 31-Mar-101Wahid Bhimji – StoRM.

CASPUR Site Report Andrei Maslennikov Lead - Systems Karlsruhe, May 2005.

ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.

Grid job submission using HTCondor Andrew Lahiff.

Author - Title- Date - n° 1 Partner Logo WP5 Summary Paris John Gordon WP5 6th March 2002.

UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April

London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney.

Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.

HEP SYSMAN 23 May 2007 National Grid Service Steven Young National Grid Service Manager Oxford e-Research Centre University of Oxford.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.

Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June

BaBar Cluster Had been unstable mainly because of failing disks Very few (

15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK

SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.

Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

II EGEE conference Den Haag November, ROC-CIC status in Italy

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.

UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.

WLCG IPv6 deployment strategy

The Beijing Tier 2: status and plans

Update on Plan for KISTI-GSDC

The CREAM CE: When can the LCG-CE be replaced?

Oxford Site Report HEPSYSMAN

Edinburgh (ECDF) Update

Quattor Usage at Nikhef

Virtualization in the gLite Grid Middleware software process

ETHZ, Zürich September 1st , 2016

The LHCb Computing Data Challenge DC06

Presentation transcript:

Edinburgh (ECDF) Update Wahid Bhimji On behalf of the ECDF Team HepSysMan,10 th June 2010 June-10 Hepsysman1Wahid Bhimji - ECDF  Edinburgh Setup  Hardware upgrades  Progress in last year  Current Issues

Edinburgh Setup Group computing: – Managed centrally within physics dept. – ~30 Desktops: SL5 + ~ 15 Laptops – New(ish) storage Servers: ~20TB shared – gLite tarball UI + Local Atlas KITs etc. – Local Condor pool: > 300 cores Grid Computing: – ECDF - Edinburgh Compute and Data Facility June-10 HepsysmanWahid Bhimji - ECDF2

What is ECDF? Edinburgh Compute and Data Facility University wide shared resource ~7% (AER) GridPP fairshare use. Cluster (and most middleware hardware) maintained by central systems team. Griddy extras maintained by: –Wahid Bhimji – Storage Support (~0.2 FTE) –Andrew Washbrook – Middleware Support (~0.8) –ECDF Systems Team (~0.3) –Steve Thorn - Middleware Support (~0.1) June-10 HepsysmanWahid Bhimji - ECDF3 In Physics In IS Dept.

ECDF Total Resources - Compute Current 128 dual (x2) quad core (x2) Future (Eddie Mk2) Two upgrade phases Old quad-cores retained. Phase 1 acceptance tests now! June-10 HepsysmanWahid Bhimji - ECDF4

ECDF Storage - Current Cluster storage: 160 TB, GPFS – Previously not used by GridPP (except for homedirs and software areas (for which it is not the best anyway)) – Now 10 TB available to GridPP via StoRM Main GridPP Storage 30 TB: – “Standard” DPM + Pool Servers June-10 HepsysmanWahid Bhimji - ECDF5

Storage - Future Cluster – Integrating with existing GPFS setup IBM DS5100 storage platform 15k RPM 300GB FC drives Metadata on SSDs 4x IBM X3650 M3 servers, 48GB RAM, 10GE GridPP Bulk Storage 3 * (Dell R * MD1200) =~ 150 TB Probably also GPFS – through Storm Arriving today June-10 HepsysmanWahid Bhimji - ECDF6

General Status : Improvements since last year Last Year’s talk (by Steve T): -“Problems: Staffing...” (Sam/ Greig had left – Andy/I just started that month) -Middleware: Many middleware nodes on SL3 (1/2 CEs, MON, UI, sBDII) -“GridPP share reduced (no more funding)” -> very few jobs running June-10 HepsysmanWahid Bhimji - ECDF7

Now Staffing: Andy and I now (somewhat) established. Middleware Services New lcg-CE, StoRM SE and SL5 MON, BDII in place Cream-CE - SGE compatibility being validated - will replace the older lcg-CE host Good reliability: ECDF utilisation Guaranteed fair-share for four years (fixed share not usage) Responds well to demand: e.g soaked up free cycles over Christmas to deliver ~half the cluster June-10 HepsysmanWahid Bhimji - ECDF8

So delivery improved We’re still small but Get >100% “wall clock time” CPU utilization (fairshare of big cluster allows us to get back at busy times the under utilization of quiet ones) June-10 HepsysmanWahid Bhimji - ECDF9

And… SL5 Migration Successful ECDF moved nodes slowly to SL5 started July `09, ending ~March this year. GridPP switch to SL5 performed in October `09 – very smooth but then some issues with package (and CMT) dependencies for LHCb and ATLAS. Non-LHC VO Support Providing support to integrate UKQCD software (DiGS) with SRM (tested on both storm and DPM) June-10 HepsysmanWahid Bhimji - ECDF10 ATLAS Analysis testing Series of Hammercloud tests completed in Jan ’09 on current DPM setup Site is analysis ready – though slots / space are limited Expect to increase scope with new storage

StoRM/GPFS New StoRM 1.5 node – Currently mounts existing cluster GPFS space over NFS (using a NAS cluster) (systems team don’t want us to mount the whole shared GPFS FS) – WNs mount this GPFS space “normally” Initial ACL issue – Storm sets then checks acl immediately. So (intermittently) failed due to nfs client attribute caching. Mounting with noac option "fixes" it. Validation tests completed: Sam tests / lcg-cp etc. Single ATLAS analysis jobs run well on GPFS (> 90% CPU eff compared to ~70 % for rfio) Planning hammercloud tests for this setup though ultimately will be using new storage servers June-10 HepsysmanWahid Bhimji - ECDF11

Not all there yet – issues GPFS grief (on software and home dirs) (ongoing – though an easy “fix”): Shared resource so limited ability to tune for uses. LHCb SAM test recursively lists files in all previous versions of ROOT. LHCb software install recursively chmods its many many directories CMS accesses multiple shared libraries in SW area put strain on WNs. ATLAS SW area already moved to NFS – will need to move others too CA Sam Test Timeouts (goingon forever) In listing CAs after RPM check GPFS? - but /etc/grid-security now local on WN and interactively works MON box didn’t publish during SL4/5 switchover (resolved) it couldn’t deal with dashes in queue name Clock skew in virtual instances causing authentication problems (resolved) VMware fix in. June-10 HepsysmanWahid Bhimji - ECDF12

CE issues Some issues in our shared environment e.g.: – Can’t have yaim do what it wants to an SGE master. – Have to use batch system config / queues that exist … etc… Current issues include: Older lcg-ce load goes mental from time to time CMS jobs? globus-gatekeeper and forks ? Hopefully CREAM will be better New LCG-CE WMS submission to the SGE batch system : Jobs are terminated prematurely before the output is collected: even ATLAS CE-sft-job SAM test fails (sometimes). No obvious pattern to job failures observed. CREAM and SGE: Possible issues from requirement of a SGE stagein script to be put in the main SGE prolog and epilog scripts Don’t want this script to be run for all jobs submitted to the batch system Looking at alternatives e.g. conditional execution based on gridpp June-10 HepsysmanWahid Bhimji - ECDF13

Conclusions Many improvements in middleware, reliability and delivery since we were here in 09 New hardware available soon - significant increases in resource Shared service is working here: but it’s not always easy June-10 HepsysmanWahid Bhimji - ECDF14