LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.

Slides:



Advertisements
Similar presentations
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
Sign-Off on Commonwealth Incident Prioritization: Defines Priority with which Incident will be managed URGENCY/ IMPACT High A service outage with broad.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
WLCG Service Schedule June 2007.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
2 Sep Experience and tools for Site Commissioning.
John Gordon STFC-RAL Tier1 Status 9 th July, 2008 Grid Deployment Board.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
CERN Physics Database Services and Plans Maria Girone, CERN-IT
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
LHCb: March/April Operational Report NCB 10 th May 2010.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CERN - IT Department CH-1211 Genève 23 Switzerland Tier-0 CCRC’08 May Post-Mortem Miguel Santos Ricardo Silva IT-FIO-FS.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
The WLCG Service from a Tier1 Viewpoint Gareth Smith 7 th July 2010.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
LCG Storage Workshop “Service Challenge 2 Review” James Casey, IT-GD, CERN CERN, 5th April 2005.
Patricia Méndez Lorenzo Status of the T0 services.
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
Database Requirements Updates from LHC Experiments WLCG Grid Deployment Board Meeting CERN, Geneva, Switzerland February 7, 2007 Alexandre Vaniachine (Argonne)
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
WLCG Service Report ~~~ WLCG Management Board, 15 th December
Jean-Philippe Baud, IT-GD, CERN November 2007
WLCG Management Board, 30th September 2008
Cross-site problem resolution Focus on reliable file transfer service
Database Services at CERN Status Update
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
Olof Bärring LCG-LHCC Review, 22nd September 2008
WLCG Service Interventions
WLCG Service Report 5th – 18th July
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008

LCG 2 The Problem The MoU commitments do not permit many long breaks before availability drops below the required level. The LHC experiment computing models are sensitive to breaks in service at any Tier1. Anecdotally the number of T1 breaks in service is felt to be too high and has not dropped recently. Recovering breaks in service soak up staff effort both for sites and experiments

LCG Types of Problem My questions Few answers from T1. Summarise How to improve? Buy better hardware? Redundancy Best Practice 3

LCG My Questions 1 In 2008, what has been your experience of different types of serious incident (eg >0.5 day down). 1. Catastrophic failures which affected all your services, eg power failure, air-con failure, 2. Hardware failures (disk crash, cpu died) which resulted in loss of service (ie no failover) 3. Middleware failure - where the service failed and needed non-trivial manual intervention to bring it back to service. 4

LCG My Questions 1 ASGCBNLCNAFFZKFNALIN2P3NDGFNLPICRALTRIUMF 1 Cat Har MidRSfew18 5 FZK – storage slowdown, no middleware breaks FNAL – FCC has generator, GCC vulnerable but nothing this year. FNAL – local hardware problems but not for CMS FNAL – Phedex (now improved) and FTS IN2P3 – mware services failed due to Oracle patch RAL – 2 double disk RAID failures RAL – a variety of different Castor issues

LCG My Questions 2 1.Which services do you believe that you have hardened. –Ie redundancy, failover, UPS, whatever is relevant. 2.Have you identified any services which you plan to harden over the winter? 3.Have you identified any services which you cannot see how to harden sufficiently? 6

LCG My Questions 2 ASGCBNLCNAFFZKFNALIN2P3NDGFNLPICRALTRIUMF DoneXXX WinterXXX CannotXXX 7

LCG My Questions 2 ASGCASGC BNLBNL CNAFCNAF FZKFNALIN2P3NDGFNDGF NLNL PICPIC RAL TRIUMFTRIUMF DoneCE, FTS, WMS MonitoringUPS, CE,LFC, WMS, dcache CE, FTS, WMS, SRM Winter-Power sep,db UPS pnfs SSD+ new db FTSCastor and Oracle CannotSRM-Dcache core nodes, LFC OtherBDII, sBDII, LFC 8

LCG Best Practice Redundancy –not all services benefit –Independent or round robin UPS Mirroring system disks, –& isolating system disks from service Well documented recovery procedures –So that anyone called in can restart or replace a service –Tested –For individual services and full power cuts Capacity Planning –Plan to cope with the planned load plus a safety margin, not the load you see –But what is the planned load? 9

LCG But… Are all sites doing all of these? 10

LCG Other Issues Middleware On call not mandatory, cannot work all night. Often need many experts Reduced capacity –Running on half total load is usually simple, reduce batch work –But what if one transformer went? Are there instances of critical services on another? 11

LCG Outcomes? Sharing best practice Workshops Documentation Review each other Top priority middleware improvements –Bug fixes 12