WLCG Service Report ~~~ WLCG Management Board, 7 th April 2009.

Slides:



Advertisements
Similar presentations
New VOMS servers campaign GDB, 8 th Oct 2014 Maarten Litmaath IT/SDC.
Advertisements

LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
Storage Accounting John Gordon, STFC GDB March 2013.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Planning Issues GDB June Harry Renshall, Jamie Shiers.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
DJ: WLCG CB – 25 January WLCG Overview Board Activities in the first year Full details (reports/overheads/minutes) are at:
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
Olof Bärring LCG-LHCC Review, 22nd September 2008
1 VO User Team Alarm Total ALICE ATLAS CMS
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
WLCG Service Report 5th – 18th July
Take the summary from the table on
WLCG Collaboration Workshop: Outlook for 2009 – 2010
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 7 th April 2009

Introduction This report is primarily about the service since last week’s MB, but I would like to start with a brief summary of the quarter based on the draft contribution to the LCG QR Although this is not an exhaustive list of all serious problems in the quarter, we can still draw some conclusions from it: 2 SiteWhenIssue CERN08/01Many user jobs killed on lxbatch due to memory problems CERN17/01FTS transfer problems for ATLAS CERN23/01FTS / SRM / CASTOR transfer problems for ATLAS CERN26/01Backwards incompatible change on SRM affected ATLAS / LHCb CERN27/02Accidental deletion of RAID volumes in C2PUBLIC CERN04/03General CASTOR outage for 3 hours CERN14/03CASTOR ATLAS outage for 12 hours CNAF21/02Network outage to Tier2s and some Tier1s FZK24/01FTS & LFC down for 3 days ASGC25/02Fire affecting all site – services temporarily relocated RAL24/03Site down after power glitches. Knock-on effects for several days

QR – Conclusions (1/2) Not all sites are yet reporting problems consistently – some appear ‘only’ in broadcast messages which makes it very hard to track and (IMHO) impossible to learn  If you don’t learn you are destined to repeat e.g. from a single joint operations meeting (Indico 55792) - herehere SARA: OUTAGE: From 02:00 4 April to 02:00 5 April. Service: dCache SE. SARA: OUTAGE: From 09:30 30 March to 21:00 30 March. Service: srm.grid.sara.nl. SARA: OUTAGE: From 15:13 27 March to 02:00 31 March. Service: celisa.grid.sara.nl. Fileserver malfunction. CERN: At Risk: From 11:00 31 March to 12:00 31 March. Service: VOMS (lcg-voms.cern.ch). FZK: OUTAGE: From 14:21 30 March to 20:00 30 March. Service: fts-fzk.gridka.de INFN-CNAF: OUTAGE: From 02:00 28 March to 19:00 3 April. Service: ENTIRE SITE. INFN-T1: OUTAGE: From 16:00 27 March to 17:00 3 April. Service: ENTIRE SITE. NDGF-T1: At risk: From 12:31 27 March to 16:31 30 March. Service: srm.ndgf.org (ATLAS). NDGF-T1: At risk: From 12:31 27 March to 13:27 31 March. Service: ce01.titan.uio.no. As per previous estimates, one site outage per month (Tier0+Tiers1) due to power and cooling is to be expected  It is very important to find some track of these through the daily operations meetings and weekly summaries We must improve on this in the current (STEP’09) quarter – all significant service / site problems need to be reported and some minimal analysis – as discussed at the WLCG Collaboration workshop – provided spontaneously  I believe that there should be some SERVICE METRICS – as well as experiment metrics – for STEP’09 which should reflect the above  See GDB tomorrow – they are not new by the way! 3

QR – Conclusions (2/2)  CASTOR and other data management related issues at CERN are still too high – we need to monitor this closely and (IMHO again) pay particular attention to CASTOR-related interventions DM PK called about once per week; mainly CASTOR; sometimes CASTOR DB*; more statistics needed but frequency is painfully high… ASGC fire – what are the lessons and implications for other sites? Do we need a more explicit (and tested) disaster recovery strategy for other sites and CERN in particular? [ Status later ] Otherwise the service is running smoothly and improvements can clearly be seen on timescales of months / quarters The report from this quarter is significantly shorter than that for previous quarters – this does not mean that all of the issues mentioned previously have gone away! 4

HEPiX Questionnaire - Extract 4.Disaster recovery : a)How do you ensure your backup’ed files are recoverable? 1.Do you have an off-site tape vault? 2.Do you have a disaster recovery plan ? 3.Did you verify it by a simulation? b)Have you got a total service hardware redundancy for main services? 1.How many main services cannot be automatically switched? 2.If possible, cite some of them c)How is staff availability addressed to cope with a problem: Obligation? Money stimulus? Vacation stimulus? Willingness? Other? 5.Use Case 1 : fire problem in the computing room a)Is there any special automatic fire or smoke detection mechanism in the room? b)Is there any special automatic fire extinction mechanism? Which one? c)How much time before fire department arrives on scene? d)Do you have an evacuation plan at your disposal ? e)Did you simulate such a plan? which results ? 5 Proposal: all WLCG T0+T1 sites to complete!

JanFebMarAprMayJunJulAugSepOctNovDecJanFebMarAprMayJunJulAugSepOctNovDecJanFeb SUpp runningHI? STEP’09 May + June WLCG timeline Capacity commissioned 2010 Capacity commissioned Switch to SL5/64bit completed? Deployment of glexec/SCAS; CREAM; SRM upgrades; SL5 WN EGEE-III ends EGI... ??? Workshops A pre-CHEP 2010 (17-22 Oct) workshop probably does not make sense, nor does a more restricted A-P event (IMHOA)

GGUS Summaries 7 Alarm testing was done last week: the goal was that alarms were issued and analysis was complete well in advance of these meetings!  From Daniele’s summary of the CMS results:  “In general, overall results are very satisfactory, and in [the] 2nd round the reaction times and the appropriateness of the replies were even more prompt than in 1st round”  Links to detailed reports are available on the agenda page – or use GGUS ticket search for gory details!  Discussions during daily meetings revealed mismatch: DM PK & expectations: there is no piquet coverage for services other than CASTOR, SRM, FTS and LFC. (For all other services, the alarm tickets will be treated as best effort * ) VO concernedUSERTEAMALARMTOTAL ALICE22711 ATLAS CMS LHCb Totals

RAL-ATLAS Example Public Diary: Dear ggusmail: The alarms you sent was recognised as being an urgent request from an accredited VO. Your report has been escalated appropriately, but do please keep in mind that there is no guaranteed response time associated with your alarm. However, we will do our best to address the issue you reported as soon as possible. Your mail has been forwarded to the alarms list and escalated in our Nagios support systems. Regards, The RAL-LCG2 alarms response team 8

DE-T1 ATLAS Example Description: ALARM SYSTEM TEST: Fake ticket use case--> LFC down at your site Detailed description: Dear T1 site, this is another *TEST* GGUS Alarm ticket, as required by WLCG to all ATLAS T1s. This is specifically a test executed by ATLAS. For this test, ATLAS expert thinks that your LFC is down. Official instructions: sites please should "follow the procedure foreseen if this were a true alarm. The site will have to close the GGUS ticket confirming they understand what they would have done, had this been a true alarm". Thanks in advance Stephane Solution: Dear ATLAS-Team, the communication between the database backend and the LFC frontends have been tested and are working fine. All daemons are running and querying all frontends shows no errors. With kind regards, Angela This solution has been verified by the submitter. 9

10

ASGC Update [ Jason ] After relocating the facilities from ASGC DC to IDC, we took another week to resume the power on trial before entering the IDC. also, the complex local management policy have delay the whole progress for another week. Now, all T1 services should have been restored. Full details are in the following slides. 11

12 ServiceDateDetails ASGC Network two days after the fire incident (Feb 27) (relocate into main bld. of AS campus where we have main fiber connectivity from all other service provider). service relocated into computer room of IoP are: VOMS, CA, GSTAT, DNS, mail, and list. ASGC BDII & core service first week after the incident (Mar 7) services consider in first relocation including also LFC, FTS (DB, and web frontend), VOMRS, UI, and T1 DPM. ASGC core services relocate from IoP/4F to IDC at Mar 18th we took around two weeks to resume power on trial outside data center area due to the concerning dusting in the facilities that might trigger VESDA system in the data center. majority of the system have been relocate into rack space at Mar 25. T2 core services Mar 29this including CE/SEs while the expired crl and out dated ca release have cause instability of the SAM probes the first few days. we later integrate the T1/T2 pool that all submission will turn to same batch scheduler to help utilizing the resources of new quad core, x86-64 computing nodes. …[ Full report in slide notes… ]

Summary GGUS alarm ticket test successful – repeat quarterly!  ALL sites should spontaneously report on service issues (wrt MoU targets) and give regular updates on these: we agreed targets in this area back in Q  The hard work that the support teams are doing at all sites is visible & recognised – it would be nice if we no longer had to talk about the above problems at the end of STEP’09 – other than to confirm that they are [ by then ] history… [ Maybe we can “celebrate” this and other STEP-related successes in a similar way to Prague w/s? ]  ”The priority to improve the site readiness is there NOW and not only in May/June when ATLAS and the other VOs are actively scale-testing at the same time” 13