WLCG Service Report ~~~ WLCG Management Board, 10 th November 2009 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 18 th August
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Workshop Summary (my impressions at least) Dirk Duellmann, CERN IT LCG Database Deployment & Persistency Workshop.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Database Administrator RAL Proposed Workshop Goals Dirk Duellmann, CERN.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Feedback from the Tier1s GDB, September CNAF 24x7 support On-call person for all critical infrastructural services (cooling, power etc..) Manager.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Tier 1 Status and Recent Major WLCG Service Incidents LCG-LHCC Referees Meeting 22 September 2008.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
M.C. Vetterli – WLCG-CB, March ’09 – #1 Simon Fraser WLCG Collaboration Board Meeting Praha, March 22 nd, 2009 Thanks to Milos for hosting us.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
Setting up NGI operations Ron Trompert EGI-InSPIRE – ROD teams workshop1.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
Computing Operations Roadmap
Database Readiness Workshop Intro & Goals
The CREAM CE: When can the LCG-CE be replaced?
WLCG Management Board, 16th July 2013
WLCG Service Report 5th – 18th July
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 10 th November

Introduction Covers the weeks 26th October to 6 th November. Main Issues arising or closed ASGC ATLAS conditions database re-synchronization. ASGC CASTOR Service outage SARA FTS and LFC database inconsistency after h/w move Large amount of batch queries at CERN from LHCb New LINUX exploits and associated patch campaigns SIR for IN2P3 cooling incident on Nov 3rd 2

Meeting Attendance Last Week 3 SiteMTWTF CERNYYYYY ASGCYYYYY BNLYYYYY CNAFYYYYY FNAL FZKYYYY IN2P3YYY NDGF NL-T1YYYYY PICYY RALYYYYY TRIUMF

GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS LHCb Totals NO alarm tickets.

5

Security Intervention to protect against new LINUX exploits From the EGEE broadcast last Thu A severe vulnerability in 2.4 and 2.6 the Linux kernel(CVE ) was published, which leads to a kernel NULL pointer dereference, i.e. it falls into the same category as the previous vulnerabilities which have led to a number of successful root level intrusions. A public exploit has been released, and all sites are asked to URGENTLY APPLY the relevant security patches. Large number of installations affected (eg also WNs) Started rapid patching and reboot campaign at CERN and other sites VO responsible reacted quickly for VOboxes Intervention at CERN concluded on last Friday BNL suggested that security team might be able to give some certification information for urgent fixes together with incident announcements 6

ASGC DB Problems 2 major DB problems Atlas Condition DB: After a long outage of ATLAS conditions DB performed a complete re-instantiation using transportable tablespaces. Thanks to BNL who acted as source DB. CASTOR DB: Several dedicated phone meetings with experts from CERN and ASGC All CASTOR DBs (apart from monitoring DB) have been recovered DB config review – resulted in setup of 2 DB clusters using ASM for CASTOR ns and other CASTOR services Lessons learned for recovery procedure and setup will be part of upcoming Distributed DB Operations workshop rd&confId=

ASGC Castor Service Despite the large efforts from ASGC and CERN experts to reestablish the CASTOR service: Still problems for experiments to get a reliable service for data transfers ASGC team is working hard to resolve issues Need completion of existing Service Incident report from ASGC CERN experts offer help to provide technical input on DB and CASTOR side A larger rate of emergency situations like this are not sustainable for site teams Nor for the CERN teams 8

SIR: Cooling IN2P3 Cooling outage at CC-IN2P3 Tuesday, November 03rd 2009 Author: Marc Hausard Description Unexpected outage of cooling service while performing some work on the heating system. Timeline 15:15 Circuit breaker powers off the building control unit and the water pump. 15:25 Abnormal raise of temperature is noticed. 15:30 Staff powered off WN. Actions taken to lower the room temperature by extracting warm air. 15:50 Critical services back. Workers nodes are gradually back into production. 16:24 Incident is reported to users through newsgroup. 19:00 Batch is re-opened [….] Follow up Incident showed that the time delay allowed for reaction is very short in such failure. It has been agreed to set up an automated mechanism to switch off WN based on temperature rise. 9

Miscellaneous Reports Tape migration problems under investigation at RAL FTS 2.2 being tested by ATLAS (main reason checksum support), using patched CERN Unexpected FTS upgrades to 2.2 took place at BNL, FZK, NL-T1. dCache upgrades at many sites – so far no major problems New VObox version tested by ALICE Interventions at PIC and BNL planned with experiments affecting part of site storage 10

Summary/Conclusions Long standing problems at ASGC (CASTOR and Condition Database) hopefully resolved now. Conditions and CASTOR services reopened NL-T1 had difficulties with moving a consistent DB version for LFC to new DB cluster. A coordinated DB recovery validation exercise will take place the 26 th November: all T1 sites are encouraged to participate Additional documentation for application schema migration in preparation Urgent Security issue triggered rapid reaction at CERN other WLCG sites Activity is raising – in some areas above the sustainable level for existing staffing 11