Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.

Slides:



Advertisements
Similar presentations
CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Advertisements

Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Ian Bird LCG Project Leader Jamie Shiers Grid Support Group, CERN WLCG Project Status Report NEC 2009 September 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
CCRC’08 Weekly Update Plus Brief Comments on WLCG Collaboration Workshop Jamie Shiers ~~~ WLCG Management Board, 29 th April 2008.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 11 th November 2008.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Report from GGUS BoF Session at the WLCG.
WLCGWLCG: Experience from Real Data Taking at the LHC WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
LCG CCRC’08 Status WLCG Management Board November 27 th 2007
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Planning Issues GDB June Harry Renshall, Jamie Shiers.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Julia Andreeva on behalf of the MND section MND review.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
CCRC – Conclusions from February and update on planning for May Jamie Shiers ~~~ WLCG Overview Board, 31 March 2008.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
Accounting Review Summary from the pre-GDB related to CPU (wallclock) accounting Julia Andreeva CERN-IT GDB 13th April
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
LCG Introduction John Gordon, STFC-RAL GDB June 11 th, 2008.
Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
Take the summary from the table on
SAM Offsite Coordination
IPv6 update Duncan Rand Imperial College London
Presentation transcript:

Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009

Established operation routine Daily operations meetings (Conference calls. Representatives of VOs, sites, support and of critical WLCG services). Short reports, discussing problems, announcing interventions, new releases, etc… Short meetings, but effective and providing the daily report of all operations issues Number of attending people is growing For those who can not attend, minutes are made available shortly after every meeting and

Weekly performance summary is produced in the end of every week GGUS summary reports (enabled on the request of operations) Good sign that starting from the beginning of this year we had only 4 alarm tickets. Might be useful to enable generation of plots showing time required to notify the appropriate support person, problem resolution). Service incident reports (Postmortem. Fortunately, for many weeks we have it empty) Site availability from the VO perspectives (based on VO-specific SAM tests)

4 ALICE ATLAS CMS LHCb

Setting up regular test of alarm workflow Experiments provided use-cases for possible alarm conditions (subjects of the alarm tickets) First test had been organized in the middle of February. 40 alarm tickets sent to the sites from 4 LHC VOs. The tests will be done on regular basis

Summary of the results of the recent alarm tests SiteLHCbCMSALICE CERN Test passed but only partially, the piquet was called, problem investigated, but ticket did not arrive to the service manager, ticket stayed openOK(1:10) No particular issues had been seen by ALICE. All submitted tickets passed correctly. Time varies from 3 min to 2 hours PICOK(0:14)OK(0:12) IN2P3OK(0:13)OK(0:10) GRIDKAOK(0:14) - NIKHEF/SARA Site OK (0:14), but SARA ticket was sent to Nikhef - CNAFNot OK (authorization problem) OK (1:10) FNAL -OK(0:28) ASGC -OK(6:55) RALOK (0:20)OK(2:16) In most cases workflow worked properly. Good response time.

Critical Service Follow-up (Slide from presentation of Jamie Shiers at OGF) Targets (not commitments) proposed for Tier0 services – Similar targets requested for Tier1s/Tier2s – Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable) The MoU lists targets for responding to problems (12 hours for T1s) Tier1s: 95% of problems resolved <1 working day ? Tier2s: 90% of problems resolved < 1 working day ?  Post-mortem triggered when targets not met! 7 Time IntervalIssue (Tier0 Services)Target End 2008Consistent use of all WLCG Service Standards100% 30’Operator response to alarm / call to x5011 / alarm 99% 1 hourOperator response to alarm / call to x5011 / alarm 100% 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99%

Discussion Anything else we should do to improve communication regarding every day operation issues? Any other indicators to be included in the daily/weekly reports? Are you happy with those which are already there? Other procedures (like tests of alarm workflow) to be followed on regular basis to improve quality of operations?