Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN
Outline What are Service Incident Reports (SIRs) and when are they produced How they are used in terms of measuring the quality of the service delivered Focus on an analysis of ~100 major service problems over the last two years – Based on CHEP 2010 talk
Service Incident Reports Introduced for CCRC’08 to track significant service problems – their cause and resolution – with the goal of improving service quality They are one of a small set of Key Performance Indicators used in the regular WLCG MB reports on WLCG Operations to monitor service delivery – The others being ticket summaries (GGUS) and Site Usability plots A table of SIRs is included in the WLCG Quarterly Reports in the Operations section
Key Performance Indicators GGUS Summaries & Alarms Site Usability based on experiment tests Service Incident Reports 4Maria Girone, CHEP 2010
GGUS Summaries Drill-down provided at WLCG MB for ALARM tickets to ensure timely & consistent follow-up – Number of alarms remains low (blocking issues) – Can only be used by authorized people 5Maria Girone, CHEP 2010 TEAM tickets are a essential tool for “shifters” with escalation to alarm status if warranted
Site Usability 6 Site Usability for ATLAS Maria Girone, CHEP 2010
WLCG Operations Focuses around a daily conference call well attended by experiments, Tier0 + Tier1 service representatives, covering issues with the WLCG servicesdaily conference call Regular reports to the Management Board based on KPIs, Issues & ConcernsManagement Board Medium term issues + plans covered at a fortnightly Service Coordination meeting: this includes drill- down into new / open SIRs as well as major pending GGUS tickets Service Coordination 7Maria Girone, CHEP 2010
WLCG Operational Targets Time IntervalCritical Tier0 Services (see MoU)Target 30’Operator response to alarm / call to x501199% 1 hourOperator response to alarm / call to x % 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99% Targets approved by WLCG Overview Board Targets discussed at WLCG Grid Deployment Board 8 99% of problems resolved in 24h Maria Girone, CHEP 2010
WLCG Services WLCG services can be broken down as follows: 1.Middleware services – generic services at Grid middleware layer, typically operated by WLCG 2.Infrastructure services – fabric-oriented services operated by the sites 3.Storage services – at all sites and critical at Tier0 / Tier1s 4.Database services – mainly at Tier0 & Tier1s 5.Network – connecting the sites (OPN and GPN) Also essential for the experiments’ operations are: 6.Experiment services – developed, maintained and operated by the collaborations themselves (typically run in “VO boxes”) 9Maria Girone, CHEP 2010
Service Incident Reports are provided whenever there is an incident which is outside the MoU targets – Variation in severity and duration Reported here are those included in the WLCG Quarterly Reports Correlation with activity Breakdown by Service Area / Quarter follows Service Incident Reports 10 STEP’09 LHC Maria Girone, CHEP 2010
SIRs by Area & Quarter 11 N.B. variation in severity and duration (but above threshold) Maria Girone, CHEP 2010
Time to Resolution Response time is within targets Many problems resolved within 8 hours – Too many (~30%) take > 24h A significant number take (>)> 96h – Higher than targets (95-99% for T1/T0) 12Maria Girone, CHEP 2010
Observations Infrastructure Services: – Rather constant number of problems, at least some of which are probably unavoidable Middleware Services: – Very few incidents resulting in a SIR Network Services: – Typically degradations: some actions underway to improve expert involvement and problem resolution Storage & Database Services: – Typically complex problems that sometimes cannot be resolved within a day or so – The area where to concentrate 13Maria Girone, CHEP 2010
Infrastructure & Middleware Infrastructure services – includes basic fabric – Power and cooling, including human error – A short spike or micro-cut can cause many hours downtime – e.g. 1s power surge caused 48h downtime at ASGC in January – Not responsible for downtimes > 96h Middleware services – Instabilities still exist but no prolonged outages – Experiments have worked around problems seen
Network Problems Often complex and lengthy (weeks) to debug – particularly in case of degradations A simple model for handling network problems has been discussed at the last LHC OPN meeting & presented to November GDB It applies not only to OPN but also non-OPN links and all kinds of network problems – Cut (“straightforward”), degradation (“complex”) Regular GGUS ticket updates are also an important component of the model (next) The model still has to be approved by the MB, including escalation for problems not resolved within target intervals (paper at Tuesday’s F2F)
Site A █ AS / /28 Site B █ █ █ AS / /28 AS /25 VO X observes high failure rates / low performance in transfers between sites A & B After basic debugging declared a “network issue” Site responsibles at both site A & B informed (ticket) They are responsible for updating it and for interactions with network contacts at their respective sites – Ticket ownership follows FTS model – i.e. destination site All additional complexity – e.g. Domain C and possibly others – transparent to VO X – NRENs, GEANT, USLHCNET, etc. Network Degradation Domain A Domain B Domain C
Database Problems Numerous prolonged service / site downtimes due to various database problems – quite often DB recovery Services affected include detector conditions data and file catalogs (LFC) – Sites affected recently: NL-T1, ASGC Changes of strategy being discussed by ATLAS and LHCb – e.g. FroNTier/Squid caching – and / or simplification of DB deployment models – Requirements & timescales to be understood – Follow-up as standing agenda item at fortnightly Service Coordination meeting
Storage Problems Some due to issues with backend DB services Others: configuration issues or s/w bugs Small reduction in overall number in recent quarters as well as those lasting > 96h A high fraction of GGUS alarm and team tickets are in this area with good reaction times seen Operations load & impact to service high
Recent Problems Shared s/w area: very common – and repetitive – cause of problems to experiments – Some alternatives being tested, e.g. CVMFS Some instabilities seen by ATLAS at IN2P3 during HI run – no longer seen – These have led ATLAS to not to use the site for some activities Import of data from T0, analysis, MC production, import of data from other T1s and export to T2s Some reprocessing also moved to other “clouds” – Good communication established between site and experiment (daily reports) which will be useful also at daily operations call Short CASTOR outage due to corruption of DB file – SIR in preparation
Conclusions An analysis of SIRs has shown that there are a number of problems (typically DB / storage) not resolved with 96h and some take weeks to fix – Expert response always within targets Improvements in these areas likely to be slow – particularly if service load increases in 2011 – Simplifying model / alternatives to DBs that are being investigated may help Prolonged downtimes will continue: implement strategies for handling them systematically – e.g. declaration of site out of production with workload transferred to other sites, plus strategy for re-commissioning However, data processing and analysis has been successful throughout pp and HI runs of 2010
21 BACKUP SLIDES
MoU Tier0 Areas & Targets Raw data recording; Data export (T0-T1) & transfers (global); Data (re-)processing; Analysis 22
MoU Tier1/2 Areas & Targets 23
Recent Problems (last MB) Generally smooth operation on experiment and service side – Coped well with higher data rates during the HI run (CMS to CASTOR: 5 GB/s) One Service Incident Report received: – IN2P3 shared area problems for LHCb (interim SIR – GGUS:59880)SIR Alternatives, such as use of CernVM FS, being investigated Two more SIRs are pending: – CASTOR/xrootd problems for LHCb at CERN (GGUS:64166) – GGUS unavailability on Tuesday November 17 th Three GGUS ALARMS – CASTOR/xrootd problems for LHCb at CERN (GGUS:64166) – ATLAS transfers to/from RAL (GGUS:64228) – CNAF network problems affecting ATLAS DDM (GGUS:64459) Other notable issues reported at the daily meetings – Security updates in progress (CVE ) – Slow transfers to Lyon for ATLAS (GGUS:63631, GGUS:64202) – BDII timeouts for ATLAS at BNL due to network problems (GGUS:64039) – Database problems for ATLAS Panda and PVSS at CERN (no GGUS ticket) 24
WHAT\WHOALICEATLASCMSLHCB Assigned To Concerned VO ALARM 23 (real: 1)114(real: 34)3 2(real: 4)50(real: 12) TEAM Closed but Unsolved Still open on 2010/10/ GGUS LHC VOs’ tickets Period:2009/10/ /09/30 25Maria Girone, CHEP 2010
WHAT\WHOALICEATLASCMSLHCB CERN_PROD TRIUMF 05501(test) FZK PIC IN2P3-CC INFN-T NDGF 06401(test) NIKHEF SARA ASGC RAL 2 (1 test) USCMS_FNAL_W1 01(test)00 BNL-ATLAS GGUS LHC VOs’ tickets to T0/T1s Period: 2009/10/ /09/30 26Maria Girone, CHEP 2010
GGUS Summaries 27Maria Girone, CHEP 2010
ATLASEALICECMSLHCB total other 2355other 25other 140other 590 file transfer 71local batch system 23file transfer 173d databases 7 file access 60author/authent. 6author/authent.13vo spec software 7 vo specific software 59workload mgmt 5file access 13storage systems 6 storage systems 41data mgmt 4vo spec software 11file transfer 5 author/authent. 23information system 4workload mgmt 10operations 5 GGUS LHC VOs top 5 'Problem Types' Period: 2009/10/ /09/30 29Maria Girone, CHEP 2010
WLCG SIRs Full list of SIRs can be found at: GServiceIncidents GServiceIncidents 30Maria Girone, CHEP 2010
31 Q Maria Girone, CHEP 2010
32 Q Maria Girone, CHEP 2010
33 Q cont. Maria Girone, CHEP 2010
34 Q Maria Girone, CHEP 2010
Q cont 35Maria Girone, CHEP 2010
Q Maria Girone, CHEP 2010