Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010 WLCG Service Report Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
WLCG Operations Report – Summary KPI Status Comment GGUS tickets 4 real alarm tickets PIC dCache, NDGF SRM 2 Castor CERN Site Usability Minor issues SIRs & Change assessments 2 new SIRs And 3 closed SIR received VO User Team Alarm Total ALICE 4 1 5 ATLAS 23 92 119 CMS 10 3 2 15 LHCb 20 24 Totals 40 115 8 163 The response to alarms well) within targets.
1.2 1.2 1.2 1.2 1.2 1.2 0.1 0.1 1.1 4.4 4.7 3.1 4.6 4.6 4.6 4.6 4.6 4.1 4.2 4.2 4.2 3.2 4.3 4.5
Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 FZK-LCG2: FZK-LCG site offline for production ATLAS 1.1 SARA-MATRIX: Temporary test failure with timeout. 1.2 BNL: Old problem with CE critical test for OSG CE. The SAM ATLAS Computing Element critical test have been modified to take into account the different configuration of the OSG CE. ALICE NTR CMS 3.1 KIT: Cooling problem since Saturday. Only 20% WN online. 3.2 IN2P3: Stage Out Test failed temporarily. LHCb 4.1 GRIDKA: Power failure issue continuing since Sunday. 4.2 GRIDKA: SAM tests failing, problems staging and accessing files. Problem with dCache. Database was partially indexed. Small unscheduled downtime was taken. 4.3 IN2P3: Degradation in IN2P3 shared area. Command on a subdirectory of the shared area took more than 150 seconds while are expected less than 60 seconds 4.4 CERN: Temporary test failure. Command on a subdirectory of the shared area took more than 150 seconds while are expected less than 60 seconds 4.5 PIC: Temporary test failure 4.6 CNAF: SRM service UNIT test failed occasionally 4.7 CERN: CERN SRM UNIT test failed with communication error and i/o error.
1.1 0.1 3.1 3.2 3.2 4.2 4.2 4.3 0.1 4.1 4.3 0.1
Analysis of the availability plots COMMON FOR ALL THE EXPERIMENTS 0.1 PIC: scheduled downtime on Tuesday, the 20th of July, from 6am-6pm. ATLAS 1.1 TAIWAN: Stage-in/out job failures (GGUS:60231). The number of jobs accessing the disk servers was reduced temporarily to decrease the load. ALICE NTR CMS 3.1 KIT: dCache headnode crash. Site in emergency shutdown from 9:00 to 13:00h, the 22nd of July. 3.2 KIT: SAM CE prod & sft test jobs expiring. LHCb 4.1 GRIDKA: Occasionally CE-sft-job test failings. 4.2 CNAF: Production jobs failed: the HOME directory was not set due to a concerning LDAP. 4.3 CNAF: LFC_L-check-streams test failings.
7 VO User Team Alarm Total ALICE 4 1 5 ATLAS 23 92 119 CMS 10 3 2 15 GGUS summary (2 weeks) VO User Team Alarm Total ALICE 4 1 5 ATLAS 23 92 119 CMS 10 3 2 15 LHCb 20 24 Totals 40 115 8 163 7
ALARM Tickets NDGF SRM-dCache outage- SIR below Castor CMS Single user issuing 30k disk-to-disk copies User notified and per user limits in place CASTOR ATLAS T0 merge Disk server unstable after RAID controller firmware problems PIC dCache Wrong pool cost equation affected balancing between old and new pools
Support-related events Service incident report updates SIR received for NDGF SRM outage on 2010/0714 SIR received for GridKa cooling system failure incident of 2010/07/10. SIR received for reduced availability caused by data corruption at NL-T1 on 2010/07/05 SIR being prepared from GGUS/OSG about notification issues SIR being prepared for CERN vault cooling issues 10/5/2019 WLCG MB Report WLCG Service Report
WLCG MB Report WLCG Service Report NDGF SRM outage What time CEST What happened 2010/07/14 13:00 Scheduled downtime starts with dCache upgrade 2010/07/14 13:50 After upgrade & reboot to new firmware problems restarting the service 2010/07/14 16:00 Scheduled downtime ends and is replaced by an unscheduled downtime 2010/07/15 10:00 Services working fine as far as we can tell, Atlas SAM tests finally green too https://wiki.ndgf.org/display/ndgfwiki/20100714+dCache+server+failure 10/5/2019 WLCG MB Report WLCG Service Report
WLCG MB Report WLCG Service Report KIT Cooling Failure What time UTC What happened 2010/07/10 14:30 KIT Cooling system going down. 2010/07/10 22:10 FTS and LFC up. LHCb and ATLAS 3D DB up. 2010/07/12 13:00 3 out of 4 chillers working. Powering up compute nodes with best compute power per watt ratio. 2010/07/15 All chillers working. Powering up remaining compute nodes. https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_cooling_failure_20100710.pdf 10/5/2019 WLCG MB Report WLCG Service Report
NL-T1 Data Corruption Issues What time CEST What happened 2010/07/05 18:35 ATLAS reported failed jobs due to checksum errors. 2010/07/05 22:23 dCache shutdown on pool nodes which could possibly be affected by this issue. These nodes reside in 7 racks. 2010/07/13 15:31 4 racks put back into production after it was established that the nodes in those racks were not affected. 2010/07/15 11:52 The remaining racks are put back into production. http://sirs.grid.sara.nl/docs/NL-T1_SIR-20100705.pdf 10/5/2019 WLCG MB Report WLCG Service Report
Other Service news CNAF is doing rolling upgrade of GPFS on worker nodes ALICE working with CNAF to establish impact on their job rates LHCb is working with SARA on reducing the inpact of their storage issues on their jobs Masking hot files unavailable due to storage issues Storage issue was solved by the site today
Summary Quiet week ending the technical stop no major issues