Ian Bird LCG Project Leader Site Reviews WLCG Site Reviews Prague, 21 st March 2009.

Slides:



Advertisements
Similar presentations
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
Advertisements

RAL Tier1 Operations Andrew Sansum 18 th April 2012.
User Board - Supporting Other Experiments Stephen Burke, RAL pp Glenn Patrick.
Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
KIT – The cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) LHCOPN issues Responce of MDM monitoring Steinbuch.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
Case Management Techniques
Operations to Serve You 05/17/ The Service Desk Provides an Announcement Page? The Service Desk houses a library of SOLUTIONS that are available.
Top 10 mistakes: investigating harassment complaints september 19, 2013 presented by zaheer lakhani.
Ian Bird LCG Project Leader LHCC + C-RSG review. 2 Review of WLCG  To be held Feb 16 at CERN  LHCC Reviewers:  Amber Boehnlein  Chris.
1 Recovery and Backup RMAN TIER 1 Experience, status and questions. Meeting at CNAF June of 2007, Bologna, Italy Carlos Fernando Gamboa, BNL Gordon.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Database Administrator RAL Proposed Workshop Goals Dirk Duellmann, CERN.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Planning Issues GDB June Harry Renshall, Jamie Shiers.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
Ian Bird LCG Project Leader Updating the Resource Planning for 2009/2010.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Database Project Milestones (+ few status slides) Dirk Duellmann, CERN IT-PSS (
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
Pick Ups & Job Management
Computing Operations Roadmap
Cross-site problem resolution Focus on reliable file transfer service
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
~~~ LCG-LHCC Referees Meeting, 16th February 2010
WLCG Service Interventions
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
Workshop Summary Dirk Duellmann.
WLCG Service Report 5th – 18th July
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

Ian Bird LCG Project Leader Site Reviews WLCG Site Reviews Prague, 21 st March 2009

2 The problem(s)... Site/service unreliability Site unavailability Instabilities of the above... Frequency of major incidents and consistent follow up Problems in procuring/commissioning resources to the pledge level

3

4

5

6 ASGC BNLNL-T1 CNAF Total Many problems to ramp up resources Delays in procurements Faulty equipment Lack of power & planning Many problems to ramp up resources Delays in procurements Faulty equipment Lack of power & planning

M.C. Vetterli – LHCC review, CERN; Feb.09 – #7 Simon Fraser Tier-2 Reliability September 08 May 08 January 09

M.C. Vetterli – LHCC review, CERN; Feb.09 – #8 Simon Fraser Tier-2 Reliability 41 of 62 sites are now green; 8 more are >80% Average is now 90% All but 1 site are reporting; in particular the situation in the US has been resolved. Still some one-off issues such as a few sites with green relia- bility, but yellow availability (i.e. significant declared downtime). Tier-2 specific tests exist: - CMS has Tier-2 commissioning - ATLAS has Tier-2 specific functional tests

M.C. Vetterli – LHCC review, CERN; Feb.09 – #9 Simon Fraser

10 Serious Incidents. In last six months Castor – ASGC, CERN, CNAF, RAL dCache – FZK, IN2P3, NL-T1 Oracle – ASGC, RAL Power – ASGC, PIC, NL-T1, CNAF Cooling- CERN, IN2P3 Network- CNAF, PIC, BNL, CERN Other – CNAF, RAL, NL-T1, Fire – ASGC Tier1s will be down. Experiment models should cope.

Major Service Incidents SiteWhenWhatReport? CNAF21 FebNetwork outagePromised… ASGC25 FebFire s 25/2 & 2/3 nl-t13 MarCooling ed CERN3 MarHuman errorProvided by IT-FIO (Olof) (FIO wiki of service incidents) Wide disparity in reports – both level of detail and delay in producing them (some others still pending…) We agreed that they should be produced by the following MB – even if some issues were still not fully understood Would adopting a template – such as that used by IT-FIO or GridPP – help? (Discuss at pre-CHEP workshop…) ¿Is the MB content with the current situation ?

FZK DB SIR At GridKa/DE-KIT the FTS/LFC Oracle RAC database backend was down from January 24 to 26 (Sat, approx. 0:00 to Mon, approx. 22:30 CET). On Sat, our on-call team immediately received Nagios alerts. From approx. 9:30 on Sat, our DBA worked on the issue and found, that many Oracle backup archive logs had been filling up the disks. By trying to add an additional disk, ASM (the Oracle storage manager, i.e. the file system) got blocked. The reason was probably a mistake made by the DBA when preparing the disk to be added. Due to the fact, that the LFC data was on the affected RAC system and it was unclear if the last daily backup worked properly, the DBA decided not try simple repair attempts like rebooting nodes etc but to involve Oracle support. At approx. 16:30 on Sat. she opened an Oracle Service Request. After info/files exchange with an Oracle supporter (in timezone CET-8h) till Sat late night, another supporter (in our CET zone) came back to us on Mon, approx. 11:00. With his aid, the problem finally was solved. Remarks: It is unclear to me, why it took more than a day until we got an Oracle supporter in our timezone. It could be, that the support request was not filled in correct. I wanted to clarify this before sending a SIR since it is not clear if bashing on Oracle is fair in this case. As soon as I get to talk to the DBA I will try to clarify on which side mistakes happend. My personal opinion: even though the disk to be added to the ASM was not prepared correctly, the system should not block but the command issued to add the disk should throw an error message. From a software costing thousands of Euros per licence, I would expect that.

Service Summary – Experiment Tests 13

14 How can we improve reliability? Discuss Simple actions Ensure sites have sufficient local monitoring; including now the grid service tests/results from SAM and experiments Ensure the response to alarms/tickets works and is appropriate – test it Follow up on SIRs – does your site potentially have the same problem??? If you have a problem be honest about what went wrong – so everyone can learn Workshops To share experience and knowledge on how to run reliable/fault tolerant services WLCG, HEPiX, etc. Does this have any (big) effect? Visits Suggested that a team visits all Tier 1s (again!) to try and spread expertise... Who, when, what??? Also for some Tier 2s?

Is communication adequate? Are the relevant people consistently informed (NO!) There are sufficient communication opportunities, but they are not always used... Is staffing of services adequate? E.g. We know Tier 1s need a ~full time DBA for many daily operational actions (Castor, dCache, 3-D, etc.) But they dont all have one Large scale MSS systems need fairly large teams to operate... Sometimes does not seem enough...

16 Potential Actions Ensure the monitoring that we have available is really fully used by everyone – there is a lot of information available now Ensure that communication of issues and problems improves We must expose the problems and follow up on them Ensure that communication of best practices happens Workshops, documentation, visits Take a look at the staffing levels and the priorities of those staff Easy for me to say...