Download presentation
Presentation is loading. Please wait.
Published byGyles O’Brien’ Modified over 8 years ago
1
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August 2010 1
2
WLCG Operations Report – Summary 2 KPIStatusComment GGUS tickets5 real alarm tickets IN2P3 SRM, CNAF LFC, ASGC, SARA SRM, CERN T0 export Site UsabilityMinor issues SIRs & Change assessments2 new SIRsAnd 1 closed SIR received (PIC cooling) VOUserTeamAlarmTotal ALICE3003 ATLAS4619419259 CMS142016 LHCb234036 Totals6523019314 The response to alarms within targets.
3
0.1 1.1 4.1
4
Analysis of the availability plots COMMON FOR ALL THE EXPERIMENTS 0.1 SARA-MATRIX: Service upgrades not correctly reported to GOCDB. Site saw this quickly, but took an hour to update the GOCDB correctly and get the site back. ATLAS 1.1 RAL: SRMv2 test failures. SRMv2-ATLAS-lcg-cr timeouts. ALICE NTR. CMS NTR. LHCb 4.1 CNAF: Replication problem of LFC (GGUS:60458).
5
2.1 2.2 4.1 4.2 4.3
6
Analysis of the availability plots ATLAS NTR. ALICE 2.1 FZK: CE test failures on a regular basis throughout the week. 2.2 IN2P3: CE test failures over the weekend. CMS NTR. LHCb 4.1 CNAF: Problems with both SAM tests and real data transfer to Storm (GGUS:60875). 4.2 GRIDKA: CE-sft-vo-swdir test failures and problem accessing data (GGUS:60821). 4.3 IN2P3: SRMv2 test failures and/or degradation of the shared area (GGUS:59880).
7
2.1 3.1 3.2 4.1 4.2
8
Analysis of the availability plots ATLAS NTR. ALICE 2.1 FZK: Problems with the CE. Most of the SAM CE tests were failing. CMS 3.1 KIT: CE-cms-prod & CE-sft-job test failures. Problem with the batch system and some disk-only pools GGUS 61127. 3.2 CCIN2P3: Occasional CE-cms-mc test failures due to local stage out failures. LHCb 4.1 CNAF: LFC_L-check-streams test failures. 4.2 IN2P3: SRM problem GGUS 61023. Problem with the shared area GGUS 61045. Occasional SAM SRMv2 and CE test failures.
9
0.1 1.1 1.2 4.1 0.1
10
Analysis of the availability plots COMMON TO ALL EXPERIMENTS 0.1 NIKHEF: SRMv2 test failures. GGUS: 61266. Issue related to GGUS: 61265 → The database behind FTS service at SARA is down due to a corrupted Oracle database. ATLAS 1.1 SARA-MATRIX: The database behind FTS service at SARA is down due to a corrupted Oracle database. Unscheduled downtime. GGUS: 61265. 1.2 TAIWAN: SRM performances degraded: one SRM server died, occasional timeouts. GGUS: 61314. ALICE NTR. CMS NTR. LHCb 4.1 IN2P3: SRM/dCache outage.
11
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE3003 ATLAS4619419259 CMS142016 LHCb234036 Totals6523019314 11
12
3/18/2016WLCG MB Report WLCG Service Report 12 Support-related events since last MB There were 19 ALARM tickets since the last MB (4 weeks), all raised by ATLAS. 14 of these ALARM tickets were tests, of which 8 towards FZK by the GGUS developers. The rest of the tests were by ATLAS members testing critical services, i.e. 1 to ASGC, 1 to CERN (Castor), 1 to SARA (SRM), 3 to BNL (SRM). Real ALARMs were issued to IN2P3, CNAF, ASGC, SARA and CERN. The ALARM ticket to CERN https://gus.fzk.de/ws/ticket_info.php?ticket=60723 was a case of non-delivery of the email notification (reason being investigated). https://gus.fzk.de/ws/ticket_info.php?ticket=60723
13
ATLAS ALARM->IN2P3 SRM https://gus.fzk.de/ws/ticket_info.php?ticket=61313 3/18/2016WLCG MB Report WLCG Service Report 13 What time UTCWhat happened 2010/08/19 22:30GGUS ALARM ticket opened, automatic email notification to lhc-alarm@cc.in2p3.fr AND automatic assignment to NGI_France. 2010/08/19 22:37Automatic acknowledgement of ALARM reception. 2010/08/19 22:46Feedback by site admin on dCache problem and unscheduled downtime published in GOCDB. 2010/08/20 07:25‘solved’. It was a memory error.
14
ATLAS ALARM->CNAF LFC https://gus.fzk.de/ws/ticket_info.php?ticket=61305 3/18/2016WLCG MB Report WLCG Service Report 14 What time UTCWhat happened 2010/08/19 17:23GGUS ALARM ticket opened, automatic email notification to t1-alarms@cnaf.infn.it AND automatic assignment to ROC_Italy 2010/08/19 21:33Site admin acknowledges LFC problem on single node. Investigating. 2010/08/20 00:41Site DB expert restarted LFC daemons. 2010/08/20 01:11Submitter acknowledges problem is solved. Ticket will be closed a.s.a.p.
15
ATLAS ALARM->ASGC EXPORT FAILURES https://gus.fzk.de/ws/ticket_info.php?ticket=60983 3/18/2016WLCG MB Report WLCG Service Report 15 What time UTCWhat happened 2010/08/09 06:45GGUS ALARM ticket opened, automatic email notification to asgc-t1@lists.grid.sinica.edu.tw AND automatic assignment to ROC_Asia/Passific. 2010/08/10 10:15Sys. Admin. Re-directing to TEAM ticket https://gus.fzk.de/ws/ticket_info.php?ticket=60740 https://gus.fzk.de/ws/ticket_info.php?ticket=60740 2010/08/10 14:22Closing this ticket and continue via the TEAM one. 2010/08/16 06:00TEAM ticket put to ‘solved’ by Site Admin. Solution was a bandwidth upgrade to the specific disk servers.
16
ATLAS ALARM->SARA SRM CONTACT FAILURE https://gus.fzk.de/ws/ticket_info.php?ticket=60642 3/18/2016WLCG MB Report WLCG Service Report 16 What time UTCWhat happened 2010/07/29 06:48GGUS ALARM ticket opened, automatic email notification to grid.support@sara.nl AND automatic assignment to NGI_NL. 2010/07/29 07:32Full partition found. Downtime published in GOCDB. Space emptied. Ticket set to ‘solved’ by sys. Admin.
17
ATLAS ALARM->CERN FILE READ/EXPORT TO T1 FAILS https://gus.fzk.de/ws/ticket_info.php?ticket=60723 3/18/2016WLCG MB Report WLCG Service Report 17 What time UTCWhat happened 2010/07/31 14:01GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN 2010/07/31 16:34Submitter reports T0 data export problems. Same as GGUS:60720,60722. Email NOT received by the operator!!?? 2010/08/01 08:50Castor experts investigated. The RAID controller reset after the write, but before the file had been flushed to disk. Monitoring missed this too. Raw file was lost. Unmerged source files found. Experiment checking how to add re-created raw file in the dataset.Ticket put to ‘solved’ by the submitter.
18
Service incident report updates SIR received for CERN vault cooling issues SIR being prepared for LHCb online DB problem after power cut SIR requested for LFC/FTS DB problem at NLT1
19
Other Service problems Slow transfers between CNAF and BNL Still being investigated Several failures in DB servers at CERN requiring reboots LFC alarms at CERN due to badly optimized LHCb applications 19
20
Summary Quite a few major incidents especially in the area of: Databases: CERN and NLT1 Cooling: 30 % WNs left at PIC for one day Catalogues: Atlas LFC crashes at CNAF and PIC CERN BDII not publishing OSG info Some recurrent problems: SRM, CREAM-CE and Software Areas Slow transfers between CNAF and BNL still not understood Atlas had 30 % less resources last Thursday because of the problems at several sites 20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.