Edinburgh (ECDF) Update

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Advertisements

Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.
Edinburgh (ECDF) Update Wahid Bhimji On behalf of the ECDF Team HepSysMan,10 th June 2010 June-10 Hepsysman1Wahid Bhimji - ECDF  Edinburgh Setup  Hardware.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Wahid Bhimji Andy Washbrook And others including ECDF systems team Not a comprehensive update but what ever occurred to me yesterday.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Filesytems and file access Wahid Bhimji University of Edinburgh, Sam Skipsey, Chris Walker …. Apr-101Wahid Bhimji – Files access.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\
Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.
StoRM Some basics and a comparison with DPM Wahid Bhimji University of Edinburgh GridPP Storage Workshop 31-Mar-101Wahid Bhimji – StoRM.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
Storage Wahid Bhimji DPM Collaboration : Tasks. Xrootd: Status; Using for Tier2 reading from “Tier3”; Server data mining.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Angela Poschlad (PPS-FZK), Antonio Retico.
UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.
MW Readiness Verification Status Andrea Manzi IT/SDC 21/01/ /01/15 2.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
Virtual machines ALICE 2 Experience and use cases Services at CERN Worker nodes at sites – CNAF – GSI Site services (VoBoxes)
Advanced Computing Facility Introduction
Servizi core INFN Grid presso il CNAF: setup attuale
Argus EMI Authorization Integration
Parrot and ATLAS Connect
Dynamic Extension of the INFN Tier-1 on external resources
WLCG IPv6 deployment strategy
Status of BESIII Distributed Computing
Experience of Lustre at QMUL
The Beijing Tier 2: status and plans
ATLAS Cloud Operations
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Moroccan Grid Infrastructure MaGrid
Workload Management System
Andrea Chierici On behalf of INFN-T1 staff
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
Moving from CREAM CE to ARC CE
Update on Plan for KISTI-GSDC
Experience of Lustre at a Tier-2 site
Oxford Site Report HEPSYSMAN
Quattor Usage at Nikhef
Статус ГРИД-кластера ИЯФ СО РАН.
Glexec/SCAS Pilot: IN2P3-CC status
Conditions Data access using FroNTier Squid cache Server
BEIJING-LCG2 Site Report
Small site approaches - Sussex
TCG Discussion on CE Strategy & SL4 Move
Discussions on group meeting
Grid Canada Testbed using HEP applications
ETHZ, Zürich September 1st , 2016
Grid Management Challenge - M. Jouvin
RHUL Site Report Govind Songara, Antonio Perez,
Pete Gronbech, Kashif Mohammad and Vipul Davda
Kashif Mohammad VIPUL DAVDA
The LHCb Computing Data Challenge DC06
Presentation transcript:

Edinburgh (ECDF) Update Wahid Bhimji On behalf of the ECDF Team HepSysMan ,30th June 2011 Edinburgh Setup Hardware Last year running Current Issues June-11 Hepsysman Wahid Bhimji - ECDF

Edinburgh Setup Group computing: Grid Computing: Managed centrally within physics dept. Desktops: SL5 - rest of physics moving to SL6 Shared storage ~ O(10) TB. Now using ManageTier3SW / AtlasLocalROOTBase https://twiki.atlas-canada.ca/bin/view/AtlasCanada/ManageTier3SW to provide gLite; atlas releases; ROOT ; pathena; ganga; VERY easy. Grid Computing: ECDF - Edinburgh Compute and Data Facility June-11 Hepsysman Wahid Bhimji - ECDF

What is ECDF? Edinburgh Compute and Data Facility University wide shared resource GridPP have a fairshare but run much more due to quiet periods Cluster maintained by central systems team. Griddy extras maintained by: Wahid Bhimji – Storage Support (~0.2 FTE) Andrew Washbrook – Middleware Support (~0.5++) ECDF Systems Team (~0.3) Steve Thorn - Middleware Support (~0.1) In Physics In IS Dept. June-11 Hepsysman Wahid Bhimji - ECDF

ECDF Resources – Compute Upgrades Eddie Mk2 Two upgrade phases – last june and last month Last Deployment 8th - 16th June  Main GPFS filesystem offline for 8 days  Alternative running mode for GridPP middleware without GPFS dependencies enabled us to have exclusive use of ECDF cluster during this Previous nodes now switched off to save power. Previous Phase I (last june) Phase 2 (in now) Nodes 246 128 156 Cores 1456 1024 1872 SpecInt 19515 27008 42558 New Compute Nodes IBM iDataplex DX360 M3 CPU 2 x Intel Westmere E5620 quad-core (phase1) E5645 six core (phase 2) 68 connected with a QDR Infiniband Memory 24GB DDR3 June-11 Hepsysman Wahid Bhimji - ECDF

Network (shared) cluster on own subnet WAN: 10Gb uplink from Eddie to SRIF 20Gb SRIF@Bush to SRIF@KB 4Gb to SRIF@KB to SRIF@AT 1Gb from SRIF@AT to SPOP2 - weakest link but dedicated; not saturating and could be upgraded 10Gb from SPOP2 to SJ5 Monitoring: SRIF@AT to SPOP2 http://skiddaw.ucs.ed.ac.uk/ June-11 Hepsysman Wahid Bhimji - ECDF

ECDF Resources: special services HIGH MEM: 24-core 4-socket Intel X7542 system with 512GB GPUS: One Tesla s1070 box hosting four C1060 Tesla GPUs One M2050 box hosting eight C2050 Tesla GPUs WHOLE NODE Q: UKI-SCOTGRID-ECDF_8CORE ATLAS queue enabled and recieving "wholenode" AthenaMP jobs for some time.  June-11 Hepsysman Wahid Bhimji - ECDF

ECDF Storage Cluster storage: 176 TB, GPFS GridPP Bulk Storage Some available to GridPP via StoRM (not used) Meta Data held on SSDs and some 15k Drives for frequently accessed data GridPP Bulk Storage 3 * (Dell R610 + 3 * MD1200) =~ 200 TB DPM - (1.8.0x) Test EMI 1.0 DPM (currently on separate server) June-11 Hepsysman Wahid Bhimji - ECDF

Last Years Performance Availability Q4 2010: 100% Q1 2011: middleware machine room refurbishment. Q2 2011: Recent downtime for power work at main cluster site Production Now routinely run >1500 ATLAS prod jobs (if they choose to send them  Well over our designated ECDF share LHCb production jobs also running better in 2011 Analysis Hammerclouds validated site Accepted into PD2P. Running up to 500 analysis jobs June-11 Hepsysman Wahid Bhimji - ECDF

Previous year’s issues GPFS grief (on software and home dirs) All SW areas moved to NFS SSD Metadata seems to have also helped with spurious errors Moved homedirs to NFS for GPFS outage this month (to provision new nodes) – may not move back. CA Sam Test Timeouts - doesn’t happen anymore (maybe GPFS metadata) Older lcg-ce load goes mental from time to time CMS jobs? globus-gatekeeper and forks ? Not problematic anymore – moving to cream anyway. June-11 Hepsysman Wahid Bhimji - ECDF

Current issues Occasional ATLAS SW install nightmares Have to back out base release and reinstall Sometimes permissions not group writable and job is mapped to another sgmuser Sometimes an “install db issue” not really explained CVMFS obviously an answer but involves a systems team persuasion Orphaned Processes: gcc –v All VOs Are there any instances of CVMFS causing WN performance problems/outages?  June-11 Hepsysman Wahid Bhimji - ECDF

Current issues: Cream Now have 2 CREAM Ces (one on VM) Poor performance seen on VM instance:  Issues observed similar to Lancaster JobDBAdminPurger.sh has to be frequently run to remove cancelled jobs http://grid.pd.infn.it/cream/field.php?n=Main.HowToPurgeJobsFromTheCREAMDB  - Too many "leases" being created by the atlas pilot factories   - bupdater_loop_interval set in our blah.config was too low   - recommended to reduce purge_interval parameter in blah.config BUpdaterSGE consumes a lot of CPU caused by large amount of qstat operations through sge_helper process. (fixed currently being developed) high memory usage for 50 blahpd process causes system to swap (see https://savannah.cern.ch/bugs/?75854) June-11 Hepsysman Wahid Bhimji - ECDF

Cream – more issues  glexec authZ errors observed for a limited amount of grid users (including a SAM test user which affects our perceived availability) aside: why is ATLAS availability metric measured by WMS and delivery measured by panda jobs?  Debugging still ongoing, but assuming configuration error for now service gLite restart does not work.   tomcat processes do not get killed in time, has to be done manually  could force 10 second sleep in service restart script (suggested by Kashif)  Q: Is it only us having these issues? Cream tuning recommendations? Hardware? Need to resolve these before can retire lcg-CE (we want that as much as anyone) June-11 Hepsysman Wahid Bhimji - ECDF

Glexec Systems team could be sympathetic to installing something on WN for our use only. But it cannot be a buggy, beta-version, suid exe Once glexec is stable and has no significant bugs then we can consider approaching them Orphaned processes not acceptable Relocatable install preferable Presumably not going to be available by today…. Provisioning ARGUS anyway in preparation. June-11 Hepsysman Wahid Bhimji - ECDF

Conclusions ECDF running well and good VfM for GridPP Significant increases in compute resources Could do with some more disk (and switches) A few CREAM issues: working though them. Glexec…… June-11 Hepsysman Wahid Bhimji - ECDF