Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.

Slides:

Advertisements

Similar presentations

 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.

Advertisements

WHAT\WHOALICEATLASCMSLHCB Assigned To Concerned VO ALARM 23 (real: 1)114(real: 34)3 2(real: 4)50(real: 12) TEAM Closed.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

WLCG Service Report ~~~ WLCG Management Board, 24 th November

John Gordon STFC-RAL Tier1 Status 9 th July, 2008 Grid Deployment Board.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

LCG CCRC’08 Status WLCG Management Board November 27 th 2007

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.

Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

WLCG Service Report ~~~ WLCG Management Board, 23 rd November

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

WLCG Service Report ~~~ WLCG Management Board, 9 th February

WLCG Service Report ~~~ WLCG Management Board, 14 th February

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG Tier0 – Tier1 Service Coordination Meeting Update

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

The Worldwide LHC Computing Grid WLCG Milestones for 2007 Focus on Q1 / Q2 Collaboration Workshop, January 2007.

T0-T1 Networking Meeting 16th June Meeting

Computing Operations Roadmap

Database Readiness Workshop Intro & Goals

WLCG Management Board, 16th July 2013

~~~ LCG-LHCC Referees Meeting, 16th February 2010

WLCG Service Interventions

1 VO User Team Alarm Total ALICE ATLAS CMS

WLCG Service Report 5th – 18th July

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Presentation transcript:

Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN

Outline What are Service Incident Reports (SIRs) and when are they produced How they are used in terms of measuring the quality of the service delivered Focus on an analysis of ~100 major service problems over the last two years – Based on CHEP 2010 talk

Service Incident Reports Introduced for CCRC’08 to track significant service problems – their cause and resolution – with the goal of improving service quality They are one of a small set of Key Performance Indicators used in the regular WLCG MB reports on WLCG Operations to monitor service delivery – The others being ticket summaries (GGUS) and Site Usability plots A table of SIRs is included in the WLCG Quarterly Reports in the Operations section

Key Performance Indicators GGUS Summaries & Alarms Site Usability based on experiment tests Service Incident Reports 4Maria Girone, CHEP 2010

GGUS Summaries Drill-down provided at WLCG MB for ALARM tickets to ensure timely & consistent follow-up – Number of alarms remains low (blocking issues) – Can only be used by authorized people 5Maria Girone, CHEP 2010 TEAM tickets are a essential tool for “shifters” with escalation to alarm status if warranted

Site Usability 6 Site Usability for ATLAS Maria Girone, CHEP 2010

WLCG Operations Focuses around a daily conference call well attended by experiments, Tier0 + Tier1 service representatives, covering issues with the WLCG servicesdaily conference call Regular reports to the Management Board based on KPIs, Issues & ConcernsManagement Board Medium term issues + plans covered at a fortnightly Service Coordination meeting: this includes drill- down into new / open SIRs as well as major pending GGUS tickets Service Coordination 7Maria Girone, CHEP 2010

WLCG Operational Targets Time IntervalCritical Tier0 Services (see MoU)Target 30’Operator response to alarm / call to x501199% 1 hourOperator response to alarm / call to x % 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99% Targets approved by WLCG Overview Board Targets discussed at WLCG Grid Deployment Board 8 99% of problems resolved in 24h Maria Girone, CHEP 2010

WLCG Services WLCG services can be broken down as follows: 1.Middleware services – generic services at Grid middleware layer, typically operated by WLCG 2.Infrastructure services – fabric-oriented services operated by the sites 3.Storage services – at all sites and critical at Tier0 / Tier1s 4.Database services – mainly at Tier0 & Tier1s 5.Network – connecting the sites (OPN and GPN) Also essential for the experiments’ operations are: 6.Experiment services – developed, maintained and operated by the collaborations themselves (typically run in “VO boxes”) 9Maria Girone, CHEP 2010

Service Incident Reports are provided whenever there is an incident which is outside the MoU targets – Variation in severity and duration Reported here are those included in the WLCG Quarterly Reports Correlation with activity Breakdown by Service Area / Quarter follows Service Incident Reports 10 STEP’09 LHC Maria Girone, CHEP 2010

SIRs by Area & Quarter 11 N.B. variation in severity and duration (but above threshold) Maria Girone, CHEP 2010

Time to Resolution Response time is within targets Many problems resolved within 8 hours – Too many (~30%) take > 24h A significant number take (>)> 96h – Higher than targets (95-99% for T1/T0) 12Maria Girone, CHEP 2010

Observations Infrastructure Services: – Rather constant number of problems, at least some of which are probably unavoidable Middleware Services: – Very few incidents resulting in a SIR Network Services: – Typically degradations: some actions underway to improve expert involvement and problem resolution Storage & Database Services: – Typically complex problems that sometimes cannot be resolved within a day or so – The area where to concentrate 13Maria Girone, CHEP 2010

Infrastructure & Middleware Infrastructure services – includes basic fabric – Power and cooling, including human error – A short spike or micro-cut can cause many hours downtime – e.g. 1s power surge caused 48h downtime at ASGC in January – Not responsible for downtimes > 96h Middleware services – Instabilities still exist but no prolonged outages – Experiments have worked around problems seen

Network Problems Often complex and lengthy (weeks) to debug – particularly in case of degradations A simple model for handling network problems has been discussed at the last LHC OPN meeting & presented to November GDB It applies not only to OPN but also non-OPN links and all kinds of network problems – Cut (“straightforward”), degradation (“complex”) Regular GGUS ticket updates are also an important component of the model (next) The model still has to be approved by the MB, including escalation for problems not resolved within target intervals (paper at Tuesday’s F2F)

Site A █ AS / /28 Site B █ █ █ AS / /28 AS /25 VO X observes high failure rates / low performance in transfers between sites A & B After basic debugging declared a “network issue” Site responsibles at both site A & B informed (ticket) They are responsible for updating it and for interactions with network contacts at their respective sites – Ticket ownership follows FTS model – i.e. destination site All additional complexity – e.g. Domain C and possibly others – transparent to VO X – NRENs, GEANT, USLHCNET, etc. Network Degradation Domain A Domain B Domain C

Database Problems Numerous prolonged service / site downtimes due to various database problems – quite often DB recovery Services affected include detector conditions data and file catalogs (LFC) – Sites affected recently: NL-T1, ASGC Changes of strategy being discussed by ATLAS and LHCb – e.g. FroNTier/Squid caching – and / or simplification of DB deployment models – Requirements & timescales to be understood – Follow-up as standing agenda item at fortnightly Service Coordination meeting

Storage Problems Some due to issues with backend DB services Others: configuration issues or s/w bugs Small reduction in overall number in recent quarters as well as those lasting > 96h A high fraction of GGUS alarm and team tickets are in this area with good reaction times seen Operations load & impact to service high

Recent Problems Shared s/w area: very common – and repetitive – cause of problems to experiments – Some alternatives being tested, e.g. CVMFS Some instabilities seen by ATLAS at IN2P3 during HI run – no longer seen – These have led ATLAS to not to use the site for some activities Import of data from T0, analysis, MC production, import of data from other T1s and export to T2s Some reprocessing also moved to other “clouds” – Good communication established between site and experiment (daily reports) which will be useful also at daily operations call Short CASTOR outage due to corruption of DB file – SIR in preparation

Conclusions An analysis of SIRs has shown that there are a number of problems (typically DB / storage) not resolved with 96h and some take weeks to fix – Expert response always within targets Improvements in these areas likely to be slow – particularly if service load increases in 2011 – Simplifying model / alternatives to DBs that are being investigated may help Prolonged downtimes will continue: implement strategies for handling them systematically – e.g. declaration of site out of production with workload transferred to other sites, plus strategy for re-commissioning However, data processing and analysis has been successful throughout pp and HI runs of 2010

21 BACKUP SLIDES

MoU Tier0 Areas & Targets Raw data recording; Data export (T0-T1) & transfers (global); Data (re-)processing; Analysis 22

MoU Tier1/2 Areas & Targets 23

Recent Problems (last MB) Generally smooth operation on experiment and service side – Coped well with higher data rates during the HI run (CMS to CASTOR: 5 GB/s) One Service Incident Report received: – IN2P3 shared area problems for LHCb (interim SIR – GGUS:59880)SIR Alternatives, such as use of CernVM FS, being investigated Two more SIRs are pending: – CASTOR/xrootd problems for LHCb at CERN (GGUS:64166) – GGUS unavailability on Tuesday November 17 th Three GGUS ALARMS – CASTOR/xrootd problems for LHCb at CERN (GGUS:64166) – ATLAS transfers to/from RAL (GGUS:64228) – CNAF network problems affecting ATLAS DDM (GGUS:64459) Other notable issues reported at the daily meetings – Security updates in progress (CVE ) – Slow transfers to Lyon for ATLAS (GGUS:63631, GGUS:64202) – BDII timeouts for ATLAS at BNL due to network problems (GGUS:64039) – Database problems for ATLAS Panda and PVSS at CERN (no GGUS ticket) 24

WHAT\WHOALICEATLASCMSLHCB Assigned To Concerned VO ALARM 23 (real: 1)114(real: 34)3 2(real: 4)50(real: 12) TEAM Closed but Unsolved Still open on 2010/10/ GGUS LHC VOs’ tickets Period:2009/10/ /09/30 25Maria Girone, CHEP 2010

WHAT\WHOALICEATLASCMSLHCB CERN_PROD TRIUMF 05501(test) FZK PIC IN2P3-CC INFN-T NDGF 06401(test) NIKHEF SARA ASGC RAL 2 (1 test) USCMS_FNAL_W1 01(test)00 BNL-ATLAS GGUS LHC VOs’ tickets to T0/T1s Period: 2009/10/ /09/30 26Maria Girone, CHEP 2010

GGUS Summaries 27Maria Girone, CHEP 2010

ATLASEALICECMSLHCB total other 2355other 25other 140other 590 file transfer 71local batch system 23file transfer 173d databases 7 file access 60author/authent. 6author/authent.13vo spec software 7 vo specific software 59workload mgmt 5file access 13storage systems 6 storage systems 41data mgmt 4vo spec software 11file transfer 5 author/authent. 23information system 4workload mgmt 10operations 5 GGUS LHC VOs top 5 'Problem Types' Period: 2009/10/ /09/30 29Maria Girone, CHEP 2010

WLCG SIRs Full list of SIRs can be found at: GServiceIncidents GServiceIncidents 30Maria Girone, CHEP 2010

31 Q Maria Girone, CHEP 2010

32 Q Maria Girone, CHEP 2010

33 Q cont. Maria Girone, CHEP 2010

34 Q Maria Girone, CHEP 2010

Q cont 35Maria Girone, CHEP 2010

Q Maria Girone, CHEP 2010