WLCG Service Report ~~~ WLCG Management Board, 9 th February 2010 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
WLCG Service Report ~~~ WLCG Management Board, 23 rd March
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
The Worldwide LHC Computing Grid WLCG Milestones for 2007 Focus on Q1 / Q2 Collaboration Workshop, January 2007.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
WLCG Service Report ~~~ WLCG Management Board, 15 th December
~~~ WLCG Management Board, 28th October 2008
gLite->EMI2/UMD2 transition
WLCG Management Board, 16th July 2013
~~~ LCG-LHCC Referees Meeting, 16th February 2010
WLCG Service Interventions
1 VO User Team Alarm Total ALICE ATLAS CMS
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
WLCG Service Report 5th – 18th July
WLCG Collaboration Workshop: Outlook for 2009 – 2010
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 9 th February

Introduction This report covers the two week period since the last MB on January 26 th 2010 During this period all experiments consistently used their own web pages for providing detailed daily (and weekly in the case of CMS) reports Explicitly, ATLAS and ALICE have setup web pages – CMS and LHCb have had them since long We have also started pre-filling the daily reports based on the high- lights of these pages and projecting them during the daily meeting People outside have access to the same information This has helped greatly in streamlining the production of the daily minutes and helped focus the meetings The summaries include the main points inline to avoid too much “clicking through” – also simplifies preparation of these MB reports!  FNAL now also joins consistently – this is valuable to all! Kick-off of Tier1 Service Coordination meeting held – next call is this Thursday: Also report 2

Service Incidents and Upgrades 1.SIR received for ASGC 1s power surge that had an impact lasting 2 days 2.RAL: RAL: SIR on DB / storage issues impacting CASTOR services - leading on from issues last year and covered in the RAL Tier1 review 3.CERN: CASTORATLAS – all NS calls slowed down and “migrator problems” due to looping xrootd daemon  Tier0 services: introduction of pre-intervention Risk Analysis following discussions at January GDB Successful migrations to Chimera at several dCache sites by now FTS delegation problem: solution in certification (2.2.3) – already deployed at CERN: recommendation to sites through Tier1 Service Coordination meeting Feb 11. DB replication to BNL: some loss of conditions data since some weeks [ need to understand specifics of this… ] Change of personnel at ASGC 3

ASGC Power Surge Incident Summary: Power surge at the local power station about UTC on Monday 18 Jan; all computing services in data center affected. Most of the grid services recovered within 30 minutes and critical database services about two hours later. Some services took longer, e.g. SRM took > 6h to restore; On-going problems affecting file transfer efficiency (< 65% at times) plus other more technical issues affecting the DB cluster behind CASTOR Extreme load on system etc. Load balancing problems, wrong kernel loaded etc. Problem finally understood and resolved on Wednesday 20 Jan  Message: a very short glitch can result in major service degradations or downtimes 4

RAL DB H/W Migration Incident Summary: A scheduled outage to migrate the CASTOR Databases back to their original disk arrays encountered significant problems resulting in an extended outage. Incident duration: 5 days overrun on a scheduled two-day outage. Future mitigation: The RAL Tier1 has recently introduced a formal change control procedure. However, this database migration which had been planned for some time, pre-dated that process and was not reviewed by it. One component of the change, relating to the configuration for mounting disk areas had been reviewed. Despite this some aspects of this part of the change had not been sufficiently resolved ahead of the intervention. It is essential that all changes are effectively reviewed by the change control process. A significant amount of resilience testing had taken place ahead of the intervention. This was driven following problems last October. Those tests did show the systems had the expected resilience. The ability of those tests to replicate issues in the production system needs to be reviewed. The problems experienced at RAL that led up to this event were reviewed at a GridPP review at RAL in December. The timeline of this event is much longer than that of these bi-weekly reports to the MB and will be addressed both at next week’s LHCC review as well as in the next WLCG Quarterly Report [ Above a certain threshold something beyond an “internal SIR” is clearly required. ] 5

CERN CASTOR ATLAS Atlas migration backlog building up The xroot daemon was looping on the castoratlas name server because of a bug This was slowing down all normal name server calls which was causing the migrator policy to fail. Time line of the incident 6 WhenWhat 30-Jan :09: ATLAS Tier-0 opens remedy ticket: Tape migr queue on t0atlas 01-Feb h30Investigations start 01-Feb h30Problem found. Work around put in place. 02-Feb h00Backlog confirmed as cleared through lemon

Meeting Attendance Summary 7 SiteMTWTF CERNY/Y ASGCY/Y BNLY/Y CNAFN/N Y/NN/NY/N FNALY/Y KITY/Y IN2P3Y/Y NDGFN/NN/Y N/NY/N NL-T1N/NY/Y Y/Y/Y PICY/N N/YN/Y/Y RALY/Y TRIUMF

ATLASALICE CMS LHCb

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 IN2P3: downtime of the batch system on the 25th and 26 Jan to replace database servers 0.2 RAL: on scheduled downtime [ significant overrun: 3 days extra on 2 days foreseen! ] see Service Incident Report: PIC: had some problems with SRM server on Monday morning. Suffering from overload and some transfers timeout 09: :30 on Monday. 0.4 IN2P3: issue with SRM during the weekend, service restarted on Saturday evening ATLAS 1.1 NDGF: SRM tests failures over the week 1.2 NIKHEF: SRM tests failures (11pm-15pm on Tuesday; 15-17pm on Wednesday) ALICE 2.1 SARA: Scheduled maintenance on SARA SRM to activate new kernel 2.2 FZK: Small glitches with the SAM test “VOBOX-user-proxy-registration”, did not affect ALICE production at this site CMS 3.1 FNAL: Some temporary SRM problems LHCb 4.1 PIC: SAM tests were aborted by the Grid with Maradona error, some temporary misconfiguration in the all CEs at PIC

ATLASALICE CMS LHCb

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 RAL: Scheduled downtime 0.2 SARA CE, CREAM CE: Scheduled maintenance - moved to different h/w 0.3 KIT: Downtime for ATLAS dCache for migration pnfs to Chimera 0.4 IN2P3: Power cut: problem stopped many WNs. Several jobs crashed 0.5 CNAF: lcgadmin SAM test jobs not running/timing out on some CEs due to long running jobs in the queue (behind some ATLAS jobs!) Still 1 CE ok so overall availability OK ATLAS 1.1 RAL: Glitch on the castoratlas DB - quickly fixed 1.2 NDGF: Some of the SRM tests were failed over the week (SRMv2-ATLAS-lcg-cp timeouts) ALICE nothing to report CMS 3.1 IN2P3: Batch system issues; reverting to previous BQS version. Leaving the upgrade to newer version to end February. Needs several days to understand the issue - not enough time to reproduce. In parallel will leave new version on a test system and try to understand and fix 3.2 ASGC: CMS SAM CE installation tests failed 00: :00. CMS SW team installing packages during this period. Ticket closed LHCb 4.1 IN2P3: SRM endpoint became unresponsive: both SAM tests and normal activity from the data manager were failing with the error. The suspicious is that some CA certificate is not properly updated on the remote SRM (CERN CA) 4.2 IN2P3: Intervention on BQS backend

GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE1012 ATLAS CMS5319 LHCb Totals

3/16/2016WLCG Service Report for the MB 13 Alarm tickets The ALARM tickets were mostly tests following the GGUS Rel. of Feb 3 rd. This will happen every month right at the moment of release completion and at a reasonable timezone for the relevant Tier1s.

WLCG Collaboration Workshop Target dates: 7-9 July 2010 at Imperial College, London  Feedback due on “jamborees” or other meetings to be held in series (or parallel?) Separate out those issues that are common from those that are more specific to individual VOs.  Particularly important to address issues regarding Tier2s (and higher…) Typically not represented at more frequent meetings Can we confirm these dates?

Summary Service incidents continue: many are “under control” in the sense that recovery is controlled and well announced and the impact “acceptable” [ but not all… ] Risk analyses of major interventions: need full transparency prior to intervention plus post-intervention analysis: their purpose is to deliver improvements in the scheduling and execution of these necessary interventions Could some interventions be grouped to minimize user down-time?  Longer term view: next week’s LHCC review 15

Overview LHC status and plans Still planning for injection test Wednesday 17 th February Start beam commissioning 22 nd February Planning dependent on nQPS and HWC progress