WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.

Slides:



Advertisements
Similar presentations
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
T1 status Input for LHCb- NCB 9 th November 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
HEPiX Rome – April 2006 The High Energy Data Pump A Survey of State-of-the-Art Hardware & Software Solutions Martin Gasthuber / DESY Graeme Stewart / Glasgow.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.
WLCG Service Report ~~~ WLCG Management Board, 15 th December
Daniele Bonacorsi Andrea Sciabà
Computing Operations Roadmap
L’analisi in LHCb Angelo Carbone INFN Bologna
WLCG Management Board, 30th September 2008
~~~ WLCG Management Board, 28th October 2008
Cross-site problem resolution Focus on reliable file transfer service
Summary on PPS-pilot activity on CREAM CE
Database Services at CERN Status Update
Flavia Donno CERN GSSD Storage Workshop 3 July 2007
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
Database Readiness Workshop Intro & Goals
The CREAM CE: When can the LCG-CE be replaced?
WLCG Management Board, 16th July 2013
WLCG Service Interventions
1 VO User Team Alarm Total ALICE ATLAS CMS
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Data Management cluster summary
LHC Data Analysis using a worldwide computing grid
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008

Introduction This ‘weekly’ report covers two weeks (MB summer schedule) Last week (4 to 10 August): This week(11 August to 17 August): Notes from the daily meetings can be found from: (Some additional info from CERN C5 reports & other sources) Systematic remote participation by BNL, RAL and PIC (and increasingly NL-T1). Will encourage more as holiday season ends and startup approaches. 2

Site Reports (1/2) CERN: Following migration of the Voms registration service to new hardware on 4 August new registrations were not synchronised properly until late on 5 August due to a configuration error (the synchronisation was pointing to a test instance). Only one dteam user was affected however. CASTOR upgrades to level were completed for all the LHC experiments. Installation of the Oracle July critical patch update has been completed in most production databases with only minor problems (transparent to the endusers). RAL: Many separate problems/activities have lead to periods of non-participation in experiment testing (migration dcache to castor, upgrade to castor 2.1.7). Some delays due to experts being on vacation (SRM). Several disk full problems (stager LSF and a backend database). BNL: The long standing problem of unsuccessful automatic failover of the primary network links from BNL to TRIUMF and PIC to the secondary route over the OPN is thought to be understood and resolved. CERN will participate in some further testing. On 14 August found user analysis jobs doing excessive metadata lookups caused pnfs slowdown causing srm put/get timeouts and hence low transfer efficiencies. Could be a potential problem at many dcache sites. 3

Site Reports (2/2) PIC: On Friday 8 August PIC reported having overload problems of their PBS master node with timing out errors affecting the stability of their batch system. The node had been replaced on Tuesday so they decided to revert to the previous master node (done Thursday evening) but this also showed the same behaviour. They suspected issues in the configuration of the CE that is forwarding jobs to PBS and scheduled a 2 hour downtime on the following Monday morning to investigate. FZK: There was a problem with LFC replication to Gridka for LHCb from 31-7 till 6-8. Replication stopped shortly after the upgrade to Oracle Propagation from CERN to Gridka would end with an error (TNS error of connection dropped). At the same time Gridka DBAs reported several software crashes on cluster node 1 and since 6-8 we are using Gridka cluster node N.2 as destination. Further investigations are being performed by Gridka DBAs to see whether the network/communication issues with node N.1 are still present. General: LFC (which fixes the periodic LFC crashes) has passed through the PPS and will be released to production today or tomorrow. We have received a partial fix to the vdt limitation of 10 proxy delegations that makes WMS proxy renewal again useable by LHCb and should help ALICE. 4

Experiment reports (1/3) ALICE: Nothing special to report. LHCb: Have published to EGEE the requirement on sites to support multi-user pilot jobs – basically to support a new role=pilot Several CERN VOboxes suffered occasional (each few days) kernel panic crashes with a known kernel issue. Now fixed with this weeks kernel upgrades. An issue has arisen (ongoing at NL-T1) over the use of pool accounts for the software group management function. These require the VO software directory to be group writeable. If the sgm pool accounts are a different unix group to the other LHCb accounts they would also have to be world read. Currently affecting some SAM tests. 5

Experiment reports (2/3) CMS: During the last two weeks continued the pattern of a 1.5 day global cosmics run Wednesday to Thursday. Failure of network switch to building 613 at on 12 August stopped AFS-cron service which stopped some SAM test submission and monitoring information going into SLS. Not fixed till the next day. Have been preparing for their CRUZET4 cosmics run which in fact is not ZEro Tesla but magnet on. Run started 18 August and will last a week. Requested for this a new CERN elog instance which exposed a dependency in this service on a single staff member currently on holiday. Preparing 21 hour computing support shifts coordinated across CERN and FNAL. Remaining period to be covered by ASGC. 6

Experiment reports (3/3) ATLAS: Over weekend of 9/10 August ran a 12 hours cosmics run with only one terminating luminosity block. This splits into 16 streams of which 5-6 were particularly big resulting in unusually high rates to the receiving sites (with success). AFS-cron service failure at on 12 August stopped functional tests till next day (an unexpected dependence). During weekend of 16/17 August changed embedded cosmics data type name in raw data files to magnet-on without warning sites resulting in some data going into the wrong directories. Now performing 12 hour throughput tests (full nominal rate) each Thursday from to Typically a few hours to ramp up and down to/from the full rates. Outside of this running functional tests at 10% nominal rate and cosmics at weekends and as/when scheduled by detector teams. Will start 24 by 7 computing support shifts at end August. 7

Summary 8 Detector driven stress tests increasing. Many miscellaneous failures but we seem to have enough ‘elastic’ in the system to recover from them.