~~~ WLCG Management Board, 10th March 2009

Slides:



Advertisements
Similar presentations
Ian Bird LCG Project Leader Site Reviews WLCG Site Reviews Prague, 21 st March 2009.
Advertisements

What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
WLCG Service Report ~~~ WLCG Management Board, 9 th August
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
LCG CCRC’08 Status WLCG Management Board November 27 th 2007
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
Jean-Philippe Baud, IT-GD, CERN November 2007
WLCG IPv6 deployment strategy
WLCG Management Board, 30th September 2008
~~~ WLCG Management Board, 28th October 2008
gLite->EMI2/UMD2 transition
Database Services at CERN Status Update
1 VO User Team Alarm Total ALICE ATLAS CMS
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Database Readiness Workshop Intro & Goals
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
CASTOR-SRM Status GridPP NeSC SRM workshop
WLCG Management Board, 16th July 2013
Castor services at the Tier-0
~~~ LCG-LHCC Referees Meeting, 16th February 2010
Olof Bärring LCG-LHCC Review, 22nd September 2008
WLCG Service Interventions
1 VO User Team Alarm Total ALICE ATLAS CMS
Update from the HEPiX IPv6 WG
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
WLCG Collaboration Workshop: Outlook for 2009 – 2010
IPv6 update Duncan Rand Imperial College London
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 10th March 2009 WLCG Service Report Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 10th March 2009

Introduction This report covers the 2 week period since the last WLCG MB Our run of “no major service incidents” has been broken with several incidents in the last two weeks One of these incidents – the fire in Taipei – will take a long time to fully recover (up to 2 months!) Recovery is underway – LFC is back and FTS soon(?) Update at tomorrow’s GDB… Another – the CASTOR-related problems due to network invention at CERN – also needs further analysis: human mistakes are probably inevitable but IMHO this outage was COMPLETELY avoidable Action: IT-DES experts will be present whenever a further such invention is performed

Major Service Incidents Site When What Report? CNAF 21 Feb Network outage Promised… ASGC 25 Feb Fire E-mails 25/2 & 2/3 nl-t1 3 Mar Cooling E-mailed CERN Human error Provided by IT-FIO (Olof) (FIO wiki of service incidents) Wide disparity in reports – both level of detail and delay in producing them (some others still pending…) We agreed that they should be produced by the following MB – even if some issues were still not fully understood Would adopting a template – such as that used by IT-FIO or GridPP – help? (Discuss at pre-CHEP workshop…) Is the MB content with the current situation ?

CASTOR: switch intervention Announcements: at the IT “CCSR” meeting of 25 Feb interventions on the private switches of the DBs for the CASTOR+ services was announced Oracle DB services: The firmware of the network switches in the LAN used for accessing the NAS filers as well as the LAN used to implement the Oracle Cluster interconnect, will be upgraded This intervention should be transparent for users since these LAN's use a redundant switch configuration. Only the intervention on 2 Mar was put on the IT service status board and AFAIK no EGEE broadcast (“at risk” would have been appropriate) But the intervention was done on 4 Mar anyway! News regarding the problem and its eventual resolution was poorly handled – no update was made since 11:30 on 4 Mar – despite a “promise” We will update on the status and cause later today, sorry for inconvenience The reports at 4 Mar CCSR were inconsistent and incomplete: the service as seen by the users was down from around 9:45 for 3-4 hours At least some CASTOR daemons / components are not able to reconnect to the DB in case of problems – this is NOT CONSISTENT with WLCG service standards Cost of this intervention: IT – minimum several days; Users - ?

https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090304 CASTOR p.m. - Olof Description All CASTOR Oracle databases went down at the same time following a 'transparent' intervention on the private network switches between the NAS headnodes and the storage. This caused a general service outage of the stagers and CASTOR name server (as well as other central services). Impact : All CASTOR and SRM production instances were down for approximately 3 hours: Time line of the incident 09:43: Oracle databases went down 10:01: users started to report problems accessing CASTOR 10:26: CASTOR service manager submitted a first incident message for posting at the service status board (http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/IncidentArchive/090304-CASTOR.htm) 10:40: Most of the databases are back, the srm-*-db databases are still down for now. 11:30: Most databases back except srm-atlas-db and c2cmsdlfdb 11:34: CASTOR name server daemons restarted. This was required in order to re-establish the database sessions 11:36: service status board updated with the information that most databases were back 11:45: castorcms recovered 11:49: All databases back 13:00: castoratlas and castorlhcb recovered 13:21: All SRM servers restarted 13:30: castorpublic recovered 16:48: castorcernt3 recovered The network bandwidth plots for the various instances (see bottom of this page) gives a good indication of the outage period for the 5 instances. https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090304

CASTOR – cont. When the databases started to come back, the CASTOR2 stager daemons automatically reconnected. This was not sufficient for recovering the service. The CASTOR name servers were stuck in stale ORACLE sessions. That problem was discovered by the CASTOR development team and the servers had to be restarted. However, even after the name servers had been restarted several CASTOR instances (castoratlas, castorlhcb, castorpublic and castorcernt3) were still seriously degraded. It is likely that the CASTOR2 stager daemons were stuck in name server client commands. A full restart of all CASTOR2 stager daemons on the affected instances finally recovered the production services by ~13:00. All the SRM daemons were restarted at 13:21 for the same reasons. The recovery of the less critical instance castorcernt3 was delayed because its deployment architecture is different. It was only in the late afternoon when it was finally understood that the instance was stuck because it is running an internal CASTOR name server daemon (this will become the standard architecture with the 2.1.8 production deployment). After having restarted the daemon (and the stager daemons) the instance rapidly recovered. The existing procedure for recovering CASTOR from scratch (PowercutRecovery) needs to be reviewed. The recovery of some of the CASTOR stager instances took longer than necessary. The reason is likely to be that although the database connections had been automatically re-established, most of the threads were stuck in CASTOR name server calls (this cannot be confirmed). Next time a message should also be posted to the Service Status Board when the service has been fully recovered.

CERN Network – Warning! Postponed to April 1st? There will be an “Important Disruptive Network Intervention on March 18th” 06:00 – 08:00 Geneva time This will entail a ~15min interruption, which will affect access to AFS, NICE, MAIL and all Databases which are hosted in the GPN among other services. Next, the switches in the General Purpose Network that have not been previously upgraded will be upgraded resulting to a ~10min interruption. All services requiring access to services hosted in the Computer Center will see interruptions. 08:00 – 12:00 Geneva time The routers of the LCG network will be upgraded at 08:00 a.m., mainly affecting the Batch system and CASTOR services, including Grid related services. The switches in the LCG network that have not been previously upgraded will be upgraded next. (Recent network interventions have been to reduce the amount of work done in this major intervention) See FIO preparation page for this intervention + also joint OPS meetings Postponed to April 1st?

ASGC Fire it have been a disaster right now. the whole data center area are affected while it's the damage of the UPS battery cause the entire power system down and dust and smoke spread into other computer room in which all computing and storage facilities resided. minor water leak have been observed while fire fighter trying to suppressing the fire in power room. we leave DC an hour ago, right now, the situation in data center are not acceptable for human to stay long. Full recovery might take up to 1.5 – 2 months

GGUS Summaries Week 9 Week 10 VO concerned USER TEAM ALARM TOTAL ALICE 2 ATLAS 25 24 49 CMS 5 LHCb 17 1 18 Totals 74 Week 9 VO concerned USER TEAM ALARM TOTAL ALICE 9 (??) 9 ATLAS 32 6 12 50 CMS 3 LHCb 13 2 15 Totals 48 8 30 86 Week 10 Alarm tests performed successfully against Tier0 & Tier1s. Still problems with mail2SMS gateway at CERN (FITNR) – some VOs sent “empty” alarm and not a “sample scenario” Should we re-test soon or wait 3 months for next scheduled test?

Alarm Summary Most of alarm tickets are (successful) tests of alarm flows (some small problems still…) GGUS Ticket-ID: 46821 Description: All transfers to BNL fail Detailed description: Hello, since 21:00 (CET), all transfers to BNL fail. The main error message is : FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ ERROR] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv1] Can you provide news ? Stephane  Solution: BNL GUMS service have been fixed. This, in return, fixed the DDM problem in dCache. DDM service is working normally in BNL and US. Hiro   This solution has been verified by the submitter.

GGUS Alarm Tests cont. LHCb (Roberto Santinelli) have also performed tests yesterday – interim results are available at http://santinel.web.cern.ch/santinel/TestGGUS.pdf These results are still being analyzed – IMHO it is immature to draw concrete conclusions from them but it would be interesting to understand why and how the ATLAS & CMS tests were globally successful whereas for LHCb at least some sites – and possibly also the infrastructure – gave some problems For this and other reasons I suggest that we prepare carefully for another test to be executed and analyzed PRIOR to next month’s F2F / GDB

Service Summary – 23 Feb: Mar1

Service Summary – “Last Week”

Last 2 Weeks 11 March 2008

Summary Major service incidents at several sites in the last two weeks Prolonged outage of entire ASGC site to be expected due to fire Poorly announced & executed intervention at CERN affected all CASTOR services for several hours Other than ASGC, CNAF and RAL appear to regularly have problems with experiment tests (note that short-term glitches get smeared out in weekly views) Otherwise things are OK!