Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.

Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN

Outline What are Service Incident Reports (SIRs) and when are they produced How they are used in terms of measuring the quality of the service delivered Focus on an analysis of ~100 major service problems over the last two years – Based on CHEP 2010 talk Maria.Girone@cern.ch2

Service Incident Reports Introduced for CCRC’08 to track significant service problems – their cause and resolution – with the goal of improving service quality They are one of a small set of Key Performance Indicators used in the regular WLCG MB reports on WLCG Operations to monitor service delivery – The others being ticket summaries (GGUS) and Site Usability plots A table of SIRs is included in the WLCG Quarterly Reports in the Operations section Maria.Girone@cern.ch3

Key Performance Indicators GGUS Summaries & Alarms Site Usability based on experiment tests Service Incident Reports 4Maria Girone, CHEP 2010

GGUS Summaries Drill-down provided at WLCG MB for ALARM tickets to ensure timely & consistent follow-up – Number of alarms remains low (blocking issues) – Can only be used by authorized people 5Maria Girone, CHEP 2010 TEAM tickets are a essential tool for “shifters” with escalation to alarm status if warranted

Site Usability 6 Site Usability for ATLAS Maria Girone, CHEP 2010

WLCG Operations Focuses around a daily conference call well attended by experiments, Tier0 + Tier1 service representatives, covering issues with the WLCG servicesdaily conference call Regular reports to the Management Board based on KPIs, Issues & ConcernsManagement Board Medium term issues + plans covered at a fortnightly Service Coordination meeting: this includes drill- down into new / open SIRs as well as major pending GGUS tickets Service Coordination 7Maria Girone, CHEP 2010

WLCG Operational Targets Time IntervalCritical Tier0 Services (see MoU)Target 30’Operator response to alarm / call to x501199% 1 hourOperator response to alarm / call to x5011100% 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99% Targets approved by WLCG Overview Board Targets discussed at WLCG Grid Deployment Board 8 99% of problems resolved in 24h Maria Girone, CHEP 2010

WLCG Services WLCG services can be broken down as follows: 1.Middleware services – generic services at Grid middleware layer, typically operated by WLCG 2.Infrastructure services – fabric-oriented services operated by the sites 3.Storage services – at all sites and critical at Tier0 / Tier1s 4.Database services – mainly at Tier0 & Tier1s 5.Network – connecting the sites (OPN and GPN) Also essential for the experiments’ operations are: 6.Experiment services – developed, maintained and operated by the collaborations themselves (typically run in “VO boxes”) 9Maria Girone, CHEP 2010

Service Incident Reports are provided whenever there is an incident which is outside the MoU targets – Variation in severity and duration Reported here are those included in the WLCG Quarterly Reports Correlation with activity Breakdown by Service Area / Quarter follows Service Incident Reports 10 STEP’09 LHC Maria Girone, CHEP 2010

SIRs by Area & Quarter 11 N.B. variation in severity and duration (but above threshold) Maria Girone, CHEP 2010

Time to Resolution Response time is within targets Many problems resolved within 8 hours – Too many (~30%) take > 24h A significant number take (>)> 96h – Higher than targets (95-99% for T1/T0) 12Maria Girone, CHEP 2010

Observations Infrastructure Services: – Rather constant number of problems, at least some of which are probably unavoidable Middleware Services: – Very few incidents resulting in a SIR Network Services: – Typically degradations: some actions underway to improve expert involvement and problem resolution Storage & Database Services: – Typically complex problems that sometimes cannot be resolved within a day or so – The area where to concentrate 13Maria Girone, CHEP 2010

Infrastructure & Middleware Infrastructure services – includes basic fabric – Power and cooling, including human error – A short spike or micro-cut can cause many hours downtime – e.g. 1s power surge caused 48h downtime at ASGC in January – Not responsible for downtimes > 96h Middleware services – Instabilities still exist but no prolonged outages – Experiments have worked around problems seen Maria.Girone@cern.ch14

Network Problems Often complex and lengthy (weeks) to debug – particularly in case of degradations A simple model for handling network problems has been discussed at the last LHC OPN meeting & presented to November GDB It applies not only to OPN but also non-OPN links and all kinds of network problems – Cut (“straightforward”), degradation (“complex”) Regular GGUS ticket updates are also an important component of the model (next) The model still has to be approved by the MB, including escalation for problems not resolved within target intervals (paper at Tuesday’s F2F)

Site A █ AS36391 206.12.1.0/24 206.12.9.64/28 Site B █ █ █ AS1126 145.100.32.0/22 145.100.17.0/28 AS1104 194.171.96.128/25 VO X observes high failure rates / low performance in transfers between sites A & B After basic debugging declared a “network issue” Site responsibles at both site A & B informed (ticket) They are responsible for updating it and for interactions with network contacts at their respective sites – Ticket ownership follows FTS model – i.e. destination site All additional complexity – e.g. Domain C and possibly others – transparent to VO X – NRENs, GEANT, USLHCNET, etc. Network Degradation Domain A Domain B Domain C

Database Problems Numerous prolonged service / site downtimes due to various database problems – quite often DB recovery Services affected include detector conditions data and file catalogs (LFC) – Sites affected recently: NL-T1, ASGC Changes of strategy being discussed by ATLAS and LHCb – e.g. FroNTier/Squid caching – and / or simplification of DB deployment models – Requirements & timescales to be understood – Follow-up as standing agenda item at fortnightly Service Coordination meeting Maria.Girone@cern.ch17

Storage Problems Some due to issues with backend DB services Others: configuration issues or s/w bugs Small reduction in overall number in recent quarters as well as those lasting > 96h A high fraction of GGUS alarm and team tickets are in this area with good reaction times seen Operations load & impact to service high Maria.Girone@cern.ch18

Recent Problems Shared s/w area: very common – and repetitive – cause of problems to experiments – Some alternatives being tested, e.g. CVMFS Some instabilities seen by ATLAS at IN2P3 during HI run – no longer seen – These have led ATLAS to not to use the site for some activities Import of data from T0, analysis, MC production, import of data from other T1s and export to T2s Some reprocessing also moved to other “clouds” – Good communication established between site and experiment (daily reports) which will be useful also at daily operations call Short CASTOR outage due to corruption of DB file – SIR in preparation Maria.Girone@cern.ch19

Conclusions An analysis of SIRs has shown that there are a number of problems (typically DB / storage) not resolved with 96h and some take weeks to fix – Expert response always within targets Improvements in these areas likely to be slow – particularly if service load increases in 2011 – Simplifying model / alternatives to DBs that are being investigated may help Prolonged downtimes will continue: implement strategies for handling them systematically – e.g. declaration of site out of production with workload transferred to other sites, plus strategy for re-commissioning However, data processing and analysis has been successful throughout pp and HI runs of 2010 20Maria.Girone@cern.ch

21 BACKUP SLIDES

MoU Tier0 Areas & Targets Raw data recording; Data export (T0-T1) & transfers (global); Data (re-)processing; Analysis 22

MoU Tier1/2 Areas & Targets 23

Recent Problems (last MB) Generally smooth operation on experiment and service side – Coped well with higher data rates during the HI run (CMS to CASTOR: 5 GB/s) One Service Incident Report received: – IN2P3 shared area problems for LHCb (interim SIR – GGUS:59880)SIR Alternatives, such as use of CernVM FS, being investigated Two more SIRs are pending: – CASTOR/xrootd problems for LHCb at CERN (GGUS:64166) – GGUS unavailability on Tuesday November 17 th Three GGUS ALARMS – CASTOR/xrootd problems for LHCb at CERN (GGUS:64166) – ATLAS transfers to/from RAL (GGUS:64228) – CNAF network problems affecting ATLAS DDM (GGUS:64459) Other notable issues reported at the daily meetings – Security updates in progress (CVE-2010-4170) – Slow transfers to Lyon for ATLAS (GGUS:63631, GGUS:64202) – BDII timeouts for ATLAS at BNL due to network problems (GGUS:64039) – Database problems for ATLAS Panda and PVSS at CERN (no GGUS ticket) 24

WHAT\WHOALICEATLASCMSLHCB Assigned To 71193722 Concerned VO 842854260659 ALARM 23 (real: 1)114(real: 34)3 2(real: 4)50(real: 12) TEAM 1217154545 Closed but Unsolved 03107 Still open on 2010/10/06 023119 GGUS LHC VOs’ tickets Period:2009/10/01-2010/09/30 25Maria Girone, CHEP 2010

WHAT\WHOALICEATLASCMSLHCB CERN_PROD 241605668 TRIUMF 05501(test) FZK 41022124 PIC 0492120 IN2P3-CC 2100336 INFN-T1 8109844 NDGF 06401(test) NIKHEF 15706 SARA 298025 ASGC 08740 RAL 2 (1 test)100218 USCMS_FNAL_W1 01(test)00 BNL-ATLAS 05900 GGUS LHC VOs’ tickets to T0/T1s Period: 2009/10/01-2010/09/30 26Maria Girone, CHEP 2010

GGUS Summaries 27Maria Girone, CHEP 2010

ATLASEALICECMSLHCB total286487261658 other 2355other 25other 140other 590 file transfer 71local batch system 23file transfer 173d databases 7 file access 60author/authent. 6author/authent.13vo spec software 7 vo specific software 59workload mgmt 5file access 13storage systems 6 storage systems 41data mgmt 4vo spec software 11file transfer 5 author/authent. 23information system 4workload mgmt 10operations 5 GGUS LHC VOs top 5 'Problem Types' Period: 2009/10/01-2010/09/30 29Maria Girone, CHEP 2010

WLCG SIRs Full list of SIRs can be found at: https://twiki.cern.ch/twiki/bin/view/LCG/WLC GServiceIncidents https://twiki.cern.ch/twiki/bin/view/LCG/WLC GServiceIncidents 30Maria Girone, CHEP 2010

31 Q1 2010 Maria Girone, CHEP 2010

33 Q2 2010 cont. Maria Girone, CHEP 2010

Q3 2010 - cont 35Maria Girone, CHEP 2010

Q4 2010 36Maria Girone, CHEP 2010

Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.

Similar presentations

Presentation on theme: "Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.

Similar presentations

Presentation on theme: "Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN."— Presentation transcript:

Similar presentations

About project

Feedback