Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.

Similar presentations

Presentation on theme: "Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005."— Presentation transcript:

1 Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005

2 LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 2 Outline  Service challenges  Goals of workshop:  Operations issues  User support  Fabric management  Release strategy for gLite  Joint operations OSG/EGEE  Resource allocation

3 LCG Project, Service Challenges 3 LCG Service Challenges – ramp up to LHC start-up service SC2 SC3 LHC Service Operation Full physics run 200520072006 2008 First physics First beams cosmics June05 - Technical Design Report Sep05 - SC3 Service Phase May06 – SC4 Service Phase Sep06 – Initial LHC Service in stable operation SC4 SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period) SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput Apr07 – LHC Service commissioned

4 LCG Project, Service Challenges 4 Why Service Challenges? To test Tier-0  Tier-1  Tier-2 services  Network service  Sufficient bandwidth: ~10 Gbit/sec  Backup path  Quality of service: security, help desk, error reporting, bug fixing,..  Robust file transfer service  File servers  File Transfer Software (GridFTP)  Data Management software (SRM, dCache)  Archiving service: tapeservers,taperobots, tapes, tapedrives,..  Sustainability  Weeks in a row un-interrupted 24/7 operation  Manpower implications: ~7 fte/site  Quality of service: helpdesk, error reporting, bug fixing,..  Towards a stable production environment for experiments

5 LCG Project, Service Challenges 5 Key Principles  Service challenges results in a series of services that exist in parallel with baseline production service  Rapidly and successively approach production needs of LHC  Initial focus: core (data management) services  Swiftly expand out to cover full spectrum of production and analysis chain  Must be as realistic as possible, including end-end testing of key experiment use-cases over extended periods with recovery from glitches and longer- term outages  Necessary resources and commitment pre-requisite to success!  Effort should not be under-estimated!

6 LCG Project, Service Challenges 6 Service Challenge 3 - Phases High level view:  Throughput phase  2 weeks sustained in July 2005  “Obvious target” – GDB of July 20 th  Primary goals:  150MB/s disk – disk to Tier1s;  60MB/s disk (T0) – tape (T1s)  Secondary goals:  Include a few named T2 sites (T2 -> T1 transfers)  Encourage remaining T1s to start disk – disk transfers  Service phase  September – end 2005  Start with ALICE & CMS, add ATLAS and LHCb October/November  All offline use cases except for analysis  More components: WMS, VOMS, catalogs, experiment-specific solutions  Implies production setup (CE, SE, …)

7 LCG Project, Service Challenges 7 Basic Components For Setup Phase  Each T1 to provide 10Gb network link to CERN  Each T1 + T0 to provide SRM 1.1 interface to managed storage  This goes for the named T2s for the T2-T1 transfer tests too  T0 to provide File Transfer Service; also at named T1s for T2-T1 transfer tests  Baseline Services Working Group, Storage Management Workshop and SC3 Preparation Discussions have identified one additional data management service for SC3, namely the LFC  Not all experiments (ALICE) intend to use this  Nor will it be deployed for all experiments at each site  However, as many sites support multiple experiments, and will (presumably) prefer to offer common services, this can be considered a basic component

8 LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 8 SC timescale implications  SC3 will involve the Tier 1 sites (+ a few large Tier 2) in July  Must have the release to be used in SC3 available in mid-June  Involved sites must upgrade for July  Not reasonable to expect those sites to commit to other significant work (pre-production etc) on that timescale  T1: ASCC, BNL, CCIN2P3, CNAF, FNAL, GridKA, NIKHEF/SARA, RAL and  Expect SC3 release to include FTS, LFC, DPM, but otherwise be very similar to LCG-2.4.0  September-December: experiment “production” verification of SC3 services; in parallel set up for SC4  Expect “normal” support infrastructure (CICs, ROCs, GGUS) to support service challenge usage  Bio-med also planning data challenges  Must make sure these are all correctly scheduled

9 Workshop goals

10 LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 10 Operations issues – 1  Metrics  We need a complete set of agreed metrics that:  Are publicly available, show evolution and history  Measure operations performance: reliability of service, reliability of sites, responsiveness to problems, failure rates, downtime/availability etc., etc.  Measure quality of service for the overall service and for individual sites  “Scheduled downtime” is still downtime …  Deployment timescales and latency of upgrades  Deployment of releases takes much too long  Should be part of a sites’ quality of service metric  What is the problem and how can this be improved?  General responsiveness of sites to problems  How can this be improved?  Can the ROCs help? (they should!)

11 LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 11 Operations issues – 2  Release strategy for gLite/LCG-2.x.x/SC3  Should be presented,  Must be discussed, agreed, and committed to by the sites  Resource allocation to new VOs  Is still not resolved. This may be a deeper issue related to funding of resources, but there is an expectation that sites within EGEE provide some minimal level of resource to new applications – this is not happening very much.  The workshop should try and understand if this is a real issue for the sites, what is the reluctance to provide resources to new VOs?  Joint operations with OSG  Can we identify specific areas of collaboration?  Common tools, common procedures, problem tracking  How close can we get to the idea of non-prime shift operational support of each others’ grid service?  Would presumably need common tools and procedures

12 LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 12 User support  The current user support infrastructure is not viewed as effective  By most users  By grid experts who need to be part of the process (NB mostly they are via the rollout-list)  We (all of the stakeholders) need to re-think how user support should be done:  We do need a managed process with problem tracking and management  But it should be as simple as possible  It should provide a simple way to get to existing expertise (as in the rollout list), but encourage others to contribute  It must work effectively for the users!!  This is a simple test – do the users use it because they recognise this is the way to get their problems addressed?  This workshop must agree the way forward for user support  In a very aggressive manner – there is almost no credibility left

13 LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 13 Fabric management  Badly or un- managed sites remain a source of operations problems  Last workshop recognised this  Very little progress since  Need to start producing fabric management cookbook(s)  As proposed in the last workshop  Needed as part of SA1 deliverable  This workshop must:  Agree what is required  Provide the plan, and find people to work on this  We need drafts of these within the next couple of months  Identify other ways to improve site stability and reliability

Download ppt "Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005."

Similar presentations

Ads by Google