WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

Slides:



Advertisements
Similar presentations
CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
From Olivier to commissioning team plans for the start-up of regular operations of LHCb 30/06 to 4/07 : Global commissioning week, all detectors, full.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Ops WG Act 4 – Conclusion Guillaume.
ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Daniela Anzellotti Alessandro De Salvo Barbara Martelli Lorenzo Rinaldi.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.
Ian Bird GDB CERN, 9 th September Sept 2015
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
CCRC Status & Readiness for Data Taking ~~~ June 2008 ~~~ June F2F Meetings.
WLCG Planning Issues GDB June Harry Renshall, Jamie Shiers.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
LCG – AA review 1 Simulation LCG/AA review Sept 2006.
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
WLCG Collaboration Workshop 21 – 25 April 2008, CERN Remaining preparations GDB, 2 nd April 2008.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
WLCG Management Board, 30th September 2008
Cross-site problem resolution Focus on reliable file transfer service
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
Castor services at the Tier-0
~~~ LCG-LHCC Referees Meeting, 16th February 2010
Olof Bärring LCG-LHCC Review, 22nd September 2008
1 VO User Team Alarm Total ALICE ATLAS CMS
Take the summary from the table on
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings

Highlights  LHC First Beam day! Service Load & Security Problems  Post-Mortems – RAL CASTOR/Oracle problems still outstandingRAL CASTOR/Oracle IMHO we really need to be timely and precise with these – s are still circulating referring to “Nilo’s suggested DB config changes”. _kks_use_mutex_pin=false Preparation for WLCG session during EGEE’08 Promised for today! 2

First Beam Day The time for asking “will we be ready?” is clearly over The pertinent question now is “were we ready?”  It is clear that some of the (even key) services still need further hardening – and more besides… This is also true for some procedures – e.g. alarm mail handling – put in place too late to be fully debugged during CCRC’08 None of the “First Beam day” photos that I have seen (apart from event displays) really shows the computing aspect – something to rebalance during the LCG Gridfest? 3

Service Load Problems Service load has been seen on a number of occasions, including ATLAS conditions DB at various sites, as well as ATLAS CASTOR instance at CERN (post-mortem, including discussion of GGUS alarm ticket follow-up).post-mortem 10’ for CASTOR expert to be called 10’ for intervention to start 18’ for problem to be identified < 3 hours total from start of problem to confirmation of resolution These problems are likely to persist for at least weeks (months?) – we should understand what are the usage patterns that cause them, as well as why they were not included in CCRC’08 tests 4

Post-Mortem 18:10 - problem started 19:34 - GGUS ALARM TICKET submitted by ATLAS shifter: 19:35 - mail received by CERN Computer Centre operator From: GGUS Sent: mercredi 10 septembre :35 To: atlas-operator-alarm; Subject: GGUS-Ticket-ID: #40726 ALARM CH-CERN Problems exporting ATLAS data from 19:45 - CASTOR expert called 19:55 - CASTOR expert starts investigating 20:13 - CASTOR expert identifies that the problem is due to a hotspot. The resolution is applied (see below) and ATLAS informed. 20:47 - CASTOR re-enabled the diskserver after having confirmed that the requests were better loadbalanced over all servers in pool. 20:57 - ATLAS confirms that situation is back to normal 5

Network Problems BNL still “in the headlines” regarding network-related problems. Primary OPN link failed Thursday night when a fibre bundle was cut on Long Island. Manual failover to secondary link. Need to automate such failovers plus continue to follow-up on (relatively) high rate of problems seen with this link Network problem at CERN Monday caused 3.5 hour degradation affecting CASTOR 6

Database Service Enhancements Support (on best effort) for CMS and LHCb online databases added to the service team responsibility Oracle Data Guard stand-by databases in production for all the LHC experiments production databases (using hardware going out of warranty by the end of the year). Additional protection against  human errors  disaster recoveries  security attacks 7

WLCG Sessions during EGEE’08 The idea is to have a panel / discussion session with 3 main themes: 1.Lessons learned from the formal CCRC'08 exercise and from production activities 2.Immediate needs and short-term goals: the LHC startup, first data taking and (re-)processing of 2008 data; 3.Preparation for 2009, including the CCRC'09 planning workshop. In each case the topic will be introduced with a few issues followed by a wider discussion involving people also from the floor. Not looking for ‘official statements’ – the opinions and experience of all are valid and important. These panels have worked well at previous events (WLCG workshops, GridPP, INFN etc.) and do not require extensive preparation. It is probably useful to write down a few key points / issues in a slide or two (not a formal presentation!) It is also an opportunity to focus on some of the important issues that maybe have not been fully discussed in previous events. 8

Post-Mortems We are now pretty good at preparing timely and detailed post-mortems But what happens next? e.g. both CASTOR/ATLAS and CASTOR/RAL “post- mortes” propose actions and other follow-up Without inventing a complex procedure, how do we ensure that this happens? A: add to MB action list? WLCG operations? 9

Conclusions The service is running; problems are responded to and resolved within the time windows that we have established as realistic  Further service hardening clearly required – this will proceed in parallel to the on-going LHC commissioning  Consistent follow-up on post-mortem actions required  AFAIK, the interest of “the world” in these activities is unprecedented! 10