CCRC’08 Weekly Update ~~~ WLCG Management Board, 27 th May 2008.

Slides:

Advertisements

Similar presentations

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.

Advertisements

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.

WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals

SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.

CCRC’08 Weekly Update Plus Brief Comments on WLCG Collaboration Workshop Jamie Shiers ~~~ WLCG Management Board, 29 th April 2008.

GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:

WLCG Service Report ~~~ WLCG Management Board, 24 th November

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

SC4 Planning Planning for the Initial LCG Service September 2005.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

CCRC Status & Readiness for Data Taking ~~~ June 2008 ~~~ June F2F Meetings.

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.

WLCG Service Report ~~~ WLCG Management Board, 9 th February

WLCG Service Report ~~~ WLCG Management Board, 14 th February

CERN IT Department CH-1211 Genève 23 Switzerland t CCRC’08 Review from a DM perspective Alberto Pace (With slides from T.Bell, F.Donno, D.Duelmann,

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

WLCG Management Board, 16th July 2013

Castor services at the Tier-0

Olof Bärring LCG-LHCC Review, 22nd September 2008

Take the summary from the table on

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

The LHCb Computing Data Challenge DC06

Presentation transcript:

CCRC’08 Weekly Update ~~~ WLCG Management Board, 27 th May 2008

Introduction This is the last week of the May CCRC ’08 exercise – and the one which should see most activity The Full Monty? Generally, things continue to run rather smoothly with problems being addressed quickly This report focuses on the basic infrastructure and not on the experiments’ (impressive) results The WLCG “Added Value” These are covered e.g. in the experiments’ wikis – see CMS example linked to MB agenda 2

Sites NIKHEF: increased installed disk by 120TB. Now have 88TB of space allocated to ATLAS. Problem with cooling last week meant power-off of WNs. DPM problem (lifetime reset to 1.7 days after resize of space reservation) – upgraded to latest production release. This was discussed prior to start of CCRC ’08 – felt to be too tight (insufficient time for adequate testing…) RAL: closed for public holiday on Monday 26 th (came up in relation to planned LCG OPN intervention). Problems will be dealt with by on-call system. AFAIK, there were none… 3

Network ( & Communications ) LCG OPN: upgrade s/w of two CERN routers that connect to Tier1s – bug affecting routing for backup paths in some(?) cases. Needs reboot, hence downtime of ~5’ per router. Agreed to go ahead with upgrade (first one then the 2 nd on Wednesday if no problems seen). Upgrade of 2 nd router scheduled for tomorrow Problems seen with Alcatel phone conferencing system due to “the SIP service crashed on one of the servers”. Hopefully this can be monitored / alarmed. It is not only extremely annoying but also highly disruptive to remote collaboration. 4

Core Services (1/2) Several problems and actions on Data services in relation with the ongoing CCRC tests Castor and SRM services suffered from instabilities, which are actively being followed with the developers. Several of these issues are understood and fixed in software, and we have deployed a number of bug fix releases during this week Castorcms and Castoratlas now run the latest release, addressing the slow stager_rm and putDone commands. the Castor nameserver daemon has been upgraded to , to work around the overloaded service during backup runs of the databases Castor SRM has been upgraded to , addressing some deadlock issues All these upgrades were transparent. Long standing (since beginning of CCRC) CASTORCMS/t1transfer GC problem was finally understood by the castor development team yesterday: the SRM v22 prepareToGet request was giving a too high weight for files already resident in the pool while files being recalled from tape (like in early May) or copied from other diskpools (like the last 2 weeks) were given a too low weight. This caused a very high rate of internal traffic and it is quite impressive that the system managed to keep up given the load it generated on itself... 5

Core Services (2/2) Additional SRM problems were seen over the w/e. The list was used to report these problems but further clarification of the follow-up is required The problem was solved, but it seems independently of this mail). (SMOD follow-up) We also used the emergency site contacts phone number (TRIUMF) which high-lighted two problems: The printed version in B28 is out of date The people in the TRIUMF control room did not seem to be aware of the procedure(s) for informing (Grid) contacts at the site 6

DB Services The capture process on the online database for the ATLAS Streams environment was aborted on Tuesday 20 th around 02:23 due to a memory problem with the logminer component. The system was in sync already several minutes after restart. High-load has been seen from CMS dashboard application & traced to a specific query. Follow-up / tuning planned.  Various problems affecting SRM DB – high number of connections, threads, locking issues etc. (Still) being investigated…  Some configuration / tuning being discussed – the implications and actions will be discussed with the experiments in the immediate future (aim is to initially ensure DB stays alive, possibly requiring different retry mechanisms at middleware and experiment-ware levels) 7

Experiment Issues - LHCb LHCb have had a long standing issue at IN2P3 (a gsidcap door issue?) and RAL (where also jobs requiring to access files from WN crash, to be still understood) “File access issue(s)" continue at some sites also starting to be a problem at NIKHEF, since yesterday Encouraging (first) results from the procedure of first downloading data into the WN and then accessing it locally. WN /tmp issue (FZK, NIKHEF) with files still used / required by running jobs being cleaned up “too enthusiastically” 8

Monitoring Gridmap moved onto new h/w – gets about 25K hits per day Agreed to do systematic follow-up on problems reported – were they picked up by monitoring? See next slides… 9

Monitoring / Logging / Reporting Follow-up 10 IssueComments / Actions How to see T1-T1 transfers?Sites to install FTM – CCRC’08 baseline components updated. Failures accessing SRM_ATLASDetected by users  ticket ATLAS online-offline streams Picked up by alarms – see example mail NIKHEF worker nodesProblem with EGEE broadcast – should have been picked up on ATLAS dashboard. BNL low incoming data rateDetected by “manual” monitoring (aka eye-balling) Garbage collection on (CMS) t1transfer pool Detected by users – seen by PhEDEx. NIKHEF dCache upgrade to p3 patch level – space token definitions not visible All FTS transfers stopped – manual monitoring srmServer ‘stuck’ (monitoring also stuck  ) Actuator to “re-incarnate” as stop-gap

Streams Monitoring De : Streams Monitor Répondre à : Date : Sun, 11 May :51: À :,, Objet : CERN-PROD : Process error report Streams Monitor Error Report Report date: :51:31 Affected Site: CERN-PROD Affected Database: ATLR.CERN.CH Process Name: STRMADMIN_APPLY_ONLT Error Time: :51:21 Error Message: ORA-26714: User error encountered while applying Current process status: ABORTED See also: oms3d.cern.ch:4889/streams/streams 11

Jobs (not) running at NIKHEF … this is clear from the ATLAS production monitoring interface which is used by the production shifters: test.cern.ch/dashboard/request.py/overview?grouping=site&start- date= %2000:00:00&end-date= %2023:59:59&grouping=gridhttp://dashb-atlas-prodsys- test.cern.ch/dashboard/request.py/overview?grouping=site&start- date= %2000:00:00&end-date= %2023:59:59&grouping=grid For example the 14th and 15th of May, there is `no jobs in NIKHEF and the site is shown as 'no activity', though it is not turned to red, since the site is red when the failure rate is high. The few jobs which were submitted all failed: detail?&status=failure&site=NIKHEF-ELPROD&end-date= %2023:59:59&start-date= %2000:00:00&maxN=4&grouping=taskhttp://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/errors- detail?&status=failure&site=NIKHEF-ELPROD&end-date= %2023:59:59&start-date= %2000:00:00&maxN=4&grouping=task But since ATLAS is using the panda pilot system, if sanity check of pilot are not ok, the site might be not turned to red since the pilot just does not pull real jobs, so there is no evidence of failure of the application. I'll ask Benjamin, whether there is any kind of alarm for the shifters when the site does not process jobs, or they just do not care and redistribute the load among other sites. 12

Experiments As this report is based on the minutes of the daily con-call, it tends to focus on “infrastructure” (i.e. service) issues – rather than experiment achievements wrt their metrics Grosso modo, large-scale, successful production is being carried out Whereas, during specific challenges, a timetable can be prepared and announced, how will this work during data- taking? Re-processing – presumably – is at least to some extent scheduled(?) Add at least equivalent to “LEP page 1” to dashboards(?) – even if Tier0 processing largely de-couples accelerator operation from first pass processing & data export, is it entirely irrelevant? 13

CCRC ’08 – How Does it Work? Experiment “shifters” use Dashboards, experiment-specific SAM- tests (+ other monitoring, e.g. PhEDEx) to monitor the various production activities Problems spotted are reported through the agreed channels (ticket + elog entry) Response is usually rather rapid – many problems are fixed in (<)< 1 hour! A small number of problems are raised at the daily (15:00) WLCG operations meeting  Basically, this works! We review on a weekly basis if problems were not spotted by the above  fix [ + MB report ]  With time, increase automation, decrease eye-balling 14

CCRC ’08 – Areas of Opportunity Tier2s: MC well run in, distributed analysis still to be scaled up to (much) larger numbers of users Tier1s: data transfers (T0-T1, T1-T1, T1-T2, T2-T1) now well debugged and working sufficiently well (most of the time…); reprocessing still needs to be fully demonstrated for ATLAS (includes conditions!!!)  Tier0: data / storage management services still suffer from load / stability problems. These will have to be carefully watched during initial data taking. Only a very few (configuration?) changes are now feasible before data taking… 15

Summary CCRC ’08 post-mortem: 12/13 June; LHCC “mini-review” 1 July Overall, CCRC ’08 – both February and May runs – have been largely successful in their goals  Must assume readiness for data taking in July – according to LHC commissioning schedule This only leaves June for some final – minor – configuration changes (which by definition will not have been tested in CCRC ’08…)  The number of such changes must be kept to the absolute minimum – integrated over all services – and must be fully motivated! We will – inevitably – have to live with some limitations and “features” These should be well documented – together with any work-arounds 16