WLCG Service Report ~~~ WLCG Management Board, 9 th February
Introduction This report covers the two week period since the last MB on January 26 th 2010 During this period all experiments consistently used their own web pages for providing detailed daily (and weekly in the case of CMS) reports Explicitly, ATLAS and ALICE have setup web pages – CMS and LHCb have had them since long We have also started pre-filling the daily reports based on the high- lights of these pages and projecting them during the daily meeting People outside have access to the same information This has helped greatly in streamlining the production of the daily minutes and helped focus the meetings The summaries include the main points inline to avoid too much “clicking through” – also simplifies preparation of these MB reports! FNAL now also joins consistently – this is valuable to all! Kick-off of Tier1 Service Coordination meeting held – next call is this Thursday: Also report 2
Service Incidents and Upgrades 1.SIR received for ASGC 1s power surge that had an impact lasting 2 days 2.RAL: RAL: SIR on DB / storage issues impacting CASTOR services - leading on from issues last year and covered in the RAL Tier1 review 3.CERN: CASTORATLAS – all NS calls slowed down and “migrator problems” due to looping xrootd daemon Tier0 services: introduction of pre-intervention Risk Analysis following discussions at January GDB Successful migrations to Chimera at several dCache sites by now FTS delegation problem: solution in certification (2.2.3) – already deployed at CERN: recommendation to sites through Tier1 Service Coordination meeting Feb 11. DB replication to BNL: some loss of conditions data since some weeks [ need to understand specifics of this… ] Change of personnel at ASGC 3
ASGC Power Surge Incident Summary: Power surge at the local power station about UTC on Monday 18 Jan; all computing services in data center affected. Most of the grid services recovered within 30 minutes and critical database services about two hours later. Some services took longer, e.g. SRM took > 6h to restore; On-going problems affecting file transfer efficiency (< 65% at times) plus other more technical issues affecting the DB cluster behind CASTOR Extreme load on system etc. Load balancing problems, wrong kernel loaded etc. Problem finally understood and resolved on Wednesday 20 Jan Message: a very short glitch can result in major service degradations or downtimes 4
RAL DB H/W Migration Incident Summary: A scheduled outage to migrate the CASTOR Databases back to their original disk arrays encountered significant problems resulting in an extended outage. Incident duration: 5 days overrun on a scheduled two-day outage. Future mitigation: The RAL Tier1 has recently introduced a formal change control procedure. However, this database migration which had been planned for some time, pre-dated that process and was not reviewed by it. One component of the change, relating to the configuration for mounting disk areas had been reviewed. Despite this some aspects of this part of the change had not been sufficiently resolved ahead of the intervention. It is essential that all changes are effectively reviewed by the change control process. A significant amount of resilience testing had taken place ahead of the intervention. This was driven following problems last October. Those tests did show the systems had the expected resilience. The ability of those tests to replicate issues in the production system needs to be reviewed. The problems experienced at RAL that led up to this event were reviewed at a GridPP review at RAL in December. The timeline of this event is much longer than that of these bi-weekly reports to the MB and will be addressed both at next week’s LHCC review as well as in the next WLCG Quarterly Report [ Above a certain threshold something beyond an “internal SIR” is clearly required. ] 5
CERN CASTOR ATLAS Atlas migration backlog building up The xroot daemon was looping on the castoratlas name server because of a bug This was slowing down all normal name server calls which was causing the migrator policy to fail. Time line of the incident 6 WhenWhat 30-Jan :09: ATLAS Tier-0 opens remedy ticket: Tape migr queue on t0atlas 01-Feb h30Investigations start 01-Feb h30Problem found. Work around put in place. 02-Feb h00Backlog confirmed as cleared through lemon
Meeting Attendance Summary 7 SiteMTWTF CERNY/Y ASGCY/Y BNLY/Y CNAFN/N Y/NN/NY/N FNALY/Y KITY/Y IN2P3Y/Y NDGFN/NN/Y N/NY/N NL-T1N/NY/Y Y/Y/Y PICY/N N/YN/Y/Y RALY/Y TRIUMF
ATLASALICE CMS LHCb
Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 IN2P3: downtime of the batch system on the 25th and 26 Jan to replace database servers 0.2 RAL: on scheduled downtime [ significant overrun: 3 days extra on 2 days foreseen! ] see Service Incident Report: PIC: had some problems with SRM server on Monday morning. Suffering from overload and some transfers timeout 09: :30 on Monday. 0.4 IN2P3: issue with SRM during the weekend, service restarted on Saturday evening ATLAS 1.1 NDGF: SRM tests failures over the week 1.2 NIKHEF: SRM tests failures (11pm-15pm on Tuesday; 15-17pm on Wednesday) ALICE 2.1 SARA: Scheduled maintenance on SARA SRM to activate new kernel 2.2 FZK: Small glitches with the SAM test “VOBOX-user-proxy-registration”, did not affect ALICE production at this site CMS 3.1 FNAL: Some temporary SRM problems LHCb 4.1 PIC: SAM tests were aborted by the Grid with Maradona error, some temporary misconfiguration in the all CEs at PIC
ATLASALICE CMS LHCb
Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 RAL: Scheduled downtime 0.2 SARA CE, CREAM CE: Scheduled maintenance - moved to different h/w 0.3 KIT: Downtime for ATLAS dCache for migration pnfs to Chimera 0.4 IN2P3: Power cut: problem stopped many WNs. Several jobs crashed 0.5 CNAF: lcgadmin SAM test jobs not running/timing out on some CEs due to long running jobs in the queue (behind some ATLAS jobs!) Still 1 CE ok so overall availability OK ATLAS 1.1 RAL: Glitch on the castoratlas DB - quickly fixed 1.2 NDGF: Some of the SRM tests were failed over the week (SRMv2-ATLAS-lcg-cp timeouts) ALICE nothing to report CMS 3.1 IN2P3: Batch system issues; reverting to previous BQS version. Leaving the upgrade to newer version to end February. Needs several days to understand the issue - not enough time to reproduce. In parallel will leave new version on a test system and try to understand and fix 3.2 ASGC: CMS SAM CE installation tests failed 00: :00. CMS SW team installing packages during this period. Ticket closed LHCb 4.1 IN2P3: SRM endpoint became unresponsive: both SAM tests and normal activity from the data manager were failing with the error. The suspicious is that some CA certificate is not properly updated on the remote SRM (CERN CA) 4.2 IN2P3: Intervention on BQS backend
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE1012 ATLAS CMS5319 LHCb Totals
3/16/2016WLCG Service Report for the MB 13 Alarm tickets The ALARM tickets were mostly tests following the GGUS Rel. of Feb 3 rd. This will happen every month right at the moment of release completion and at a reasonable timezone for the relevant Tier1s.
WLCG Collaboration Workshop Target dates: 7-9 July 2010 at Imperial College, London Feedback due on “jamborees” or other meetings to be held in series (or parallel?) Separate out those issues that are common from those that are more specific to individual VOs. Particularly important to address issues regarding Tier2s (and higher…) Typically not represented at more frequent meetings Can we confirm these dates?
Summary Service incidents continue: many are “under control” in the sense that recovery is controlled and well announced and the impact “acceptable” [ but not all… ] Risk analyses of major interventions: need full transparency prior to intervention plus post-intervention analysis: their purpose is to deliver improvements in the scheduling and execution of these necessary interventions Could some interventions be grouped to minimize user down-time? Longer term view: next week’s LHCC review 15
Overview LHC status and plans Still planning for injection test Wednesday 17 th February Start beam commissioning 22 nd February Planning dependent on nQPS and HWC progress