LHCb: March/April Operational Report NCB 10 th May 2010.

Slides:

Advertisements

Similar presentations

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.

Advertisements

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

AMOD Report Doug Benjamin Duke University. Hourly Jobs Running during last week 140 K Blue – MC simulation Yellow Data processing Red – user Analysis.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.

Overview of day-to-day operations Suzanne Poulat.

WLCG Service Report ~~~ WLCG Management Board, 24 th November

Workshop Summary (my impressions at least) Dirk Duellmann, CERN IT LCG Database Deployment & Persistency Workshop.

Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

CERN Physics Database Services and Plans Maria Girone, CERN-IT

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

Alberto Aimar CERN – LCG1 Reliability Reports – May 2007

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

T1 status Input for LHCb- NCB 9 th November 2009.

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

WLCG Service Report ~~~ WLCG Management Board, 23 rd November

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.

WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.

LHCb status and plans Ph.Charpentier CERN. LHCb status and plans WLCG Workshop 1-2 Sept 2007, Victoria, BC 2 Ph.C. Status of DC06  Reminder:  Two-fold.

WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.

Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.

WLCG Service Report ~~~ WLCG Management Board, 9 th February

WLCG Service Report ~~~ WLCG Management Board, 14 th February

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Lessons learned administering a larger setup for LHCb

L’analisi in LHCb Angelo Carbone INFN Bologna

Database Readiness Workshop Intro & Goals

1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

WLCG Management Board, 16th July 2013

1 VO User Team Alarm Total ALICE ATLAS CMS

WLCG Service Report 5th – 18th July

LHCb: March/April Operational Report

LHCb status and plans Ph.Charpentier CERN.

Production Manager Tools (New Architecture)

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

The LHCb Computing Data Challenge DC06

Presentation transcript:

LHCb: March/April Operational Report NCB 10 th May 2010

GGUS tickets (March/April/ first week of May) m 91 GGUS tickets in total: o 8 normal tickets o 6 ALARM tickets (3 test) o 77 TEAM ticket m 27 GGUS tickets with shared area problems in total m 29 (real) GGUS tickets open against T0/T1: o ALARM (CERN): FTS found not working… o ALARM (GRIDKA): No space left on M-DST token o ALARM (CNAF): GPFS not working: all transfers failing o NL-T1:8 o CERN: 8 o CNAF: 5 o GRIDKA:4 o PIC: 3 o IN2p3: 1 o RAL: 0 Roberto Santinelli2

(Most) worrying Issues m FTS issue at CERN revealed some weakness in the effective communication o due to the migration in happening simultaneously to with the Jamboree o but also not clear procedures/announcement service side (expected just an alias change - CMS also complained) m SVN service down and degradation of performances m NIKHEF file access issue for some users. File available and dccp works fine but ROOT can’t open o m/w problem (see later) resolved moving to dcap everywhere on dCache. m GridKA shared area o performances issue due to concurrent activity on ATLAS software area.. adding more servers on NFS. m LHCB: banned a pic for two weeks for a missing option in the installation script (March). m CNAF and RAL: banned a week for a new connection string due to a recent migration of their Oracle to new RACs 3

April : Production Mask 4 CERN CNAF GridKA IN2p3 NL-T1 Pic RAL

Site round robin m CERN (March): o March 3 rd Offline CondDB schema corrupted. Need to restore the schema to the previous configuration but the apply process failed against 5 out of 6 T1’s (but CNAF).CondDB o March 7 th merging jobs at CERN affected: input data was not available (3 days weekend not available) o March 11 th Migration of central Dirac services o March 11 th LFC replication failing against all T1 according SAM o A glitch with the AFS shared area preventing to write lock files spawned by SAMSAM o March 17 th SVN reported to be extremely slow o March 25 th: Started xrootd tests: found the server not properly setup (supporting only kerberos) o March 29 th CASTOR: The LHCb data written to lhcbraw service has not been migrated for several days o March 31 st: FTS not working: wrong instance used. 5

Site round robin m CERN (April/May): o In April: Recurrent issue with LFC-RO at CERN (due to long story of CORAL LFC interface). Received finally a patch now part of GAUDI but still the need to use the workaround based on a local DB lookup xml file (application still based on previous version of CORAL). o 15 th CASTOR Data Access problem: lhcbmdst pool was running out of maximum number of allowed parallel transfers (mitigated adding more nodes) P  Old issue of sizing pools in terms of number of servers (and not just TB provided) o 29 th: Lhcb downstream capture for conditions was stuck for several hours. o May 4 th : lost 72 files on the lhcb default pool. A diskserver reinstalled and data (not migrated since the 22 nd of March) scrapped. P At the end the lost was limited and only 10 files unretrievable o May 5 th default pool overloaded/unavailable yesterday due to some LXBATCH user putting too much load on and small size (5 diskserver/200 transfers each) 6

Site round robin m CNAF: reported issues on under usage of their resources. This is due to the LSF batch system and its fair share mechanism [..]. This will be fixed by site-fine-tuned agents that submits directly o 8 th : Problem with the LRMS systematically failing submission there o 10 th StoRM upgraded: started to fail systematically unit (critical) test: problem fixed with a new release of the unit test code o 18 th CREAM CE direct submission: FQAN issue. Now a first prototype is working against CNAF o 24 th Too low number of pool accounts for pilot role defined o 25 th StoRM problem with LCMAPS preventing to upload data there o 30th April: Oracle RAC intervention o 8-14 April : site banned because of the changed connection string to CondDB was preventing to access as the APPCONFIG was not upgraded (CondDB person away) o 26 CREAM CE failing all pilots: configuration issue o 30 glitch on StoRM. o May 5 th GPFS issue preventing to write data on storm: ALARM and problem fixed in ~1 hour 7

Site round robin m GridKA: o 3 rd : SQLite problems due to the usual nfslock mechanism getting stuck. Restarted the nfs server. o 5-14 (April): Shared area performances: site banned during real data taking. Concurrent high load put by ATLAS…added more h/w on their NFS servers o 26 th April: MDST Space full (ALARM ticket sent due to the missed response to the automatic notification on the week end) o 28 April: PNFS to be restarted: 1 day off. m IN2p3: o Only Reported instabilities of the SRM endpoint according SAM unit test on MARCH o April: LRMS database down o AFS major issue. o SIGUSR1 signal sent instead of SIGTERM before sending the SIGKILL 8

Site Round Robin m NL-T1 (March): o March 1 st: issue with critical test for file access (after the test moved to the right version of the root fixing the compatibility issue). Due to a missing library (libgsitunnel) not deployed in the AA o March 3 rd issue with HOME not set on some WN (NIKHEF). o March 18 th NIKHEF: reported a problem in uploading output data from production jobs no matter which destination o March 20 th Discovered the LFC instance at SARA wrongly configured (Oracle RAC issue) o March 30 th Issue with Accessing data at SARA-NIKHEF: discovered (10 days later) to be due to an library incompatibility between Oracle and gsidcap never spotted before because all activities were not using concurrently CondDB (but just local SQLDB) and gsidcap (downloading data first) Roberto Santinelli9

Site Round table m NL-T1 (April) o 1-13 April: Site banned for the issue of accessing data via gsidcap and concurrent ConditionDB access. Found to be a clash of libraries and now working exclusively with dcap P The issue has never been seen because with real data the very first time LHCb access ConditionDB and use file access protocol simultaneously. (usually a download of data first) o April: NIKHEF CREAMCE issue killing the site. Received a patch to submit to $TMPDIR instead of $HOME o 29-4 May: Storage Issue due to a variety of reasons (ranging from h/w to network, from some head nodes to SRM overloaded). Roberto Santinelli10

Site round robin m pic: banned for more than 2 weeks for a problem with our application o March 18 th : problem with the installation of one of the LHCb application software for SL5-64bit. Resolved by using –force option in the installation. Site banned 2 weeks for that. o March 29 th : downtime (announced) causing some major perturbation on user activities due to some of the critical DIRAC services hosted there. o April 7 th: Issue with lhcbweb restarted accidentally. Backup host of the web portal at Barcelona University o April 26-27: Network intervention o May 6 th and 7 th Accounting system hosted at pic down twice in 24 hours 11

Site Round Robin m RAL: o March 1 st: disk server issue o March 1 st issue with voms certs o March 9 th Streams replication apply process failure o 28 th April CASTOR Oracle DB upgrade o 5-6 April: Network issue o 8-14 April: as for CNAF, the site was banned because the changed connection string was preventing to access condition DB (upgraded APPCONFIG not available, CondDB responsible was away) Roberto Santinelli12

Outcome of the T1 Jamboree : highlights m Presentation of the computing model and resources needed P First big change about to come: T2-LAC facilities m Interesting overview of DIRAC m Plans: reconstruction/reprocessing/simulations P Activity must be flexible depending on LHC. Sites do not have to ask each time for occupying their CPUs P CREAMCE usage (direct submission about to come) P gLexec pattern usage in LHCb m Most worrying issue at T1: file access and possible solutions P usage of xroot taken into consideration. Testing it P file download for production is the current solution. P parameter tuning in dcache sites (WG in WLCG for FA optimizition). P For production the file download proved to be the best approach (despite some sites claim that would be better to access data through the LAN) P Test suite “hammer cloud style“ to probe site. READY P POSIX file access (LUSTRE and NFS4.1) Roberto Santinelli13

m CPU/wallclock: sites reporting some inefficiency..multiple reasons: P too many pilot submitted now that we are running in filling mode (pilot committed to suicide if no task is available but after few minutes) P Also problems on stuck connections (data upload, data-stream with dcap/root servers, storage shortage, AFS outages, jobs hanging) P Very aggressive watch dog in place that kills jobs (stalled or not consuming CPU any longer i.e. <5% over a configurable amount of minutes) m Most worrying issue at T2 sites: Shared Area P This is a critical service for LHCb and as such must be taken into account by sites m Tape protection discussions m T1 LRMS fair shares: P quick turn-round when there is low activity P never fall down to zero. m Site round table on allocated resources and plans for 2010/2011 Roberto Santinelli14 Outcome of the T1 Jamboree : highlights

Outcome of T1 Jamboree: 2010 Resources Roberto Santinelli15 Site CPU (HEPSPEC06)Disk (TB)Tape (TB) CERN15000/ / /1800 CNAF 2700/ / /442 FZK 7480/ / /408 IN2p3 8370/ / /531 NL-T1 8992/ / /1012 PIC 1560/ /197(+50) ~200/189 RAL 8184/ / /446 CERN: full allocation 2010: June or earlier. Full allocation 2011&2012 by April 1 st each year. : 12 April : all resources seems declared to be allocated CNAF: CPU: to be ordered. Disk and Tape: delivery in March. FZK: it’s assumed that CPU fully allocated beginning of April. Disk and Tape entirely allocated in May IN2p3: full allocation 2010: 20/05/2010+ T2 resources  HEPSPEC-06/479 TB disk NL-T1: Disk and Tape fully available end of Spring (<20 th June) PIC: end of April plan to allocate 2010 pledged. Agreed to host 6% of MC data (extra 50TB) RAL: Full allocation in June and do not foresee any problem in meeting it: