WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.

Slides:

Advertisements

Similar presentations

Storage Workshop Summary Wahid Bhimji University Of Edinburgh On behalf all of the participants…

Advertisements

Copyright 2010, The World Bank Group. All Rights Reserved. Statistical Project Monitoring Section B 1.

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.

Screen 1 of 24 Reporting Food Security Information Understanding the User’s Information Needs At the end of this lesson you will be able to: define the.

Using iSIKHNAS for Budget Advocacy 2.3 Using performance indicators for budget advocacy.

Workforce Engagement Survey Engaging the workforce in simple and effective action planning.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

Ian Bird LCG Project Leader Jamie Shiers Grid Support Group, CERN WLCG Project Status Report NEC 2009 September 2009.

to Effective Conflict Resolution

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

Event Management & ITIL V3

GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.

WLCG Service Report ~~~ WLCG Management Board, 24 th November

Pudsey Grangefield School

Project monitoring and Control

Continuous Improvement Story Cover Page The cover page is the title page Rapid Process Improvement Story Template O Graphics are optional This CI Story.

Session 4: PREPARE FOR TESTS Year 7 Life Skills Student Wall Planner and Study Guide.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

Security Vulnerabilities Linda Cornwall, GridPP15, RAL, 11 th January 2006

WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.

STEP 4 Manage Delivery. Role of Project Manager At this stage, you as a project manager should clearly understand why you are doing this project. Also.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Introducing Project Management Update December 2011.

WLCGWLCG: Experience from Real Data Taking at the LHC WLCG Service Coordination & EGI InSPIRE SA3 / WP6EGI InSPIRE TNC2010, Vilnius,

WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.

Fall 2015 ECEn 490 Lecture #8 1 Effective Presentations How to communicate effectively with your audience.

ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.

Ian Bird GDB CERN, 9 th September Sept 2015

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

Scale Test for the Experiment Program: STEP’09 ~~~ June GDB.

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.

Region 1 The Center of Technology Region 1 The Center of Technology IEEE Region 1 Leadership Workshop How to Run an Effective Meeting Harold Belson August.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

DJ: WLCG CB – 25 January WLCG Overview Board Activities in the first year Full details (reports/overheads/minutes) are at:

The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

CCRC – Conclusions from February and update on planning for May Jamie Shiers ~~~ WLCG Overview Board, 31 March 2008.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.

LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.

Ian Bird WLCG Networking workshop CERN, 10 th February February 2014

SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.

WLCG Collaboration Workshop 21 – 25 April 2008, CERN Remaining preparations GDB, 2 nd April 2008.

Welcome to Eight Skills for the Effective Online Criminal Justice Student!! Unit 2 Seminar: Time Management.

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

Welcome to Introduction to Psychology! Let’s share a bit about where we are all from…

By: Antonio Vazquez.  As far as this year goes, there were a lot of struggles that I had this year, I can’t really explain why, they just occurred. 

A Collection of Writing Frames

Computing Operations Roadmap

WLCG Management Board, 16th July 2013

~~~ LCG-LHCC Referees Meeting, 16th February 2010

WLCG Collaboration Workshop;

Presentation transcript:

WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009

Overview I have given myself a somewhat more general title than that proposed by Jos I intend to cover the main operational issues as witnessed by the WLCG service in recent months… Plus the expected “challenges” in the up-coming period I will focus on the story as we tell it to the LHCC referees and other high-level review bodies Plus where and how we get this information… 2

Expectations I am not a believer in “emotive words” when describing services Concrete, achievable, measurable objectives are much more useful and relevant (often referred to as “S.M.A.R.T.”) We have some de-facto agreed “Key Performance Indicators” (which I’ll explain…) Plus some Service Targets derived – over many years of practical experience and dialogue with the experiments – that we use to measure the performance of the service 3

Critical Service Follow-up Targets (not commitments) proposed for Tier0 services Similar targets requested for Tier1s/Tier2s Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable) The MoU lists targets for responding to problems (12 hours for T1s) ¿Tier1s: 95% of problems resolved <1 working day ? ¿Tier2s: 90% of problems resolved < 1 working day ?  Post-mortem triggered when targets not met! 4 Time IntervalIssue (Tier0 Services)Target End 2008Consistent use of all WLCG Service Standards100% 30’Operator response to alarm / call to x5011 / alarm 99% 1 hourOperator response to alarm / call to x5011 / alarm 100% 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99% This activity was triggered by the experiments at the WLCG Overview Board in The proposal below has been formally accepted by the CMS Computing Management and presented to the MB and other bodies on several occasions. It continues to be the baseline against we regularly check our response, particularly in the case of alarm tickets.

How Are We Doing? Better and better, every day, every week and every month…  But there are still avoidable holes and some areas of significant concern The experiments are doing a lot of useful work, day in, day out, work-day, weekend and holiday! And they appreciate it and acknowledge it! I still believe that we can – easily – do better in some areas, with LESS effort and stress… 5

How? We still see “emergency” interventions that could be “avoided” – or at least foreseen and scheduled at a more convenient time We still see scheduled interventions that are not sufficiently well planned – that run well into “overtime” or have to be “redone” Or those that are not well planned or discussed and have a big negative impact on ongoing production The concrete examples that I have in mind refer to CERN and cover more or less uniformly the 25 year period that I have been working there… But they are not just at CERN – take some time to plan your interventions: the experiments do understand; it will be less painful for you and typically less effort than alternatives… And that way I won’t talk about you at the up-coming MB… 6

Getting More Concrete… Most of the time the WLCG services run well Response to problems is usually (well) within the agreed – somewhat aggressive IMHO – targets  But the problem is “what happens when things go (badly) wrong?”  In the best case there might be an unscheduled intervention for some hours – possibly following up to a weekend of effective downtime 7

GGUS Summary – Last Week 8 VOUSERTEAMALARMTOTAL ALICE1001 ATLAS CMS3003 LHCb2013 Tickets for ALICE are very rare  detail We always look at alarm tickets  LHCb

When This Happens… The most important thing is to keep people informed You might not know what the problem is and hence not be able to estimate when things will be solved so say it! “We are currently investigating the problem. More news at xx:xx” An update at the daily WLCG operations call – either by dialing in or by – is valuable and appreciated  Think also of providing a Service Incident Report – this (IMHO) is useful for you as service providers, plus your colleagues at other labs and your users It doesn’t have to be heavy weight – just a brief analysis of the problem, resolution, lessons learned, actions, … 9

When It Gets Worse… “It won’t happen to us” It is even more important to follow the above 10

In Other Words… Change Management Plan and communicate changes carefully; Do not make untested changes on production systems – these can be extremely costly to recover from. Incident Management The point is to learn from the experience and hopefully avoid similar problems in the future; Documenting clearly what happened together with possible action items is essential. 11

The Goal The goal is that – by end 2009 – the weekly WLCG operations / service report is quasi-automatically generated 3 weeks out of 4 with no major service incidents We are currently very far from this target with (typically) multiple service incidents that are either: New in a given week; Still being investigating or resolved several to many weeks later By definition, such incidents are characterized by severe (or total) loss of service or even a complete site (or even Cloud in the case of ATLAS) 12

Back to dCache… It is important for everybody that services supporting the main production Use Cases (import/export of data, reprocessing, MC where applicable) run smoothly and predictably – which does not mean without “hiccoughs” – with reasonable operational effort as soon as possible In reality, whilst data exchange continues to be well tested, reprocessing (also multi-VO where applicable), not to mention analysis, is still somewhat uncharted territory Which means that there will be problems. And more problems. If you don’t believe me just wait… 13

So What Do We Do? The most obvious strategy is to push for as early and as thorough testing of multi-VO re-processing activities as soon as possible I don’t (and didn’t at the time of LEP either) believe in arbitrary numbers like # tape mounts, throughput etc. – what counts is “Can you support production” under realistic and agreed conditions It may be necessary to implement some policies to limit or avoid inter- or intra- VO “interference” For the 2009 run – and for ATLAS at least that starts with cosmics from April – there is no time for “buying your way” out of the problem… And that’s before we get to analysis… 14

We Ain’t Seen Nothing Yet… If we can get the services into a reasonable state by <(<) mid-year we have some chance of being ready for much larger numbers of users, much more chaotic access patterns and much more “disorganized” attempts to access and process data Which is actually what we need to support – the users’ ability to extract science from the data and to maximize the potential of the detectors and the machine Thus I believe it is urgent that we clarify what “analysis services” mean for all sites 15

So What Does This Mean? 1.Stability – up to the limits that are currently possible 2.Clarity – where not (well always…)  Please use the existing meetings and infrastructure where applicable Join the daily WLCG operations con-call regularly – particularly when there have been problems at your site and / or when upcoming interventions are foreseen So far BNL and PIC (amongst dCache WLCG T1 sites) join very regularly, NIKHEF often – others rarely or never… They really do help not only with information flow but also in maintaining and calm(-er) and (more) relaxed atmosphere… 16

Summary We started off being a Data Grid We continue to be a Data Grid We will always be a Data Grid Data is what sets us apart Data is the challenge 17