Download presentation
Presentation is loading. Please wait.
Published byJoseph Young Modified over 8 years ago
1
Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005
2
LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 2 Outline Service challenges Goals of workshop: Operations issues User support Fabric management Release strategy for gLite Joint operations OSG/EGEE Resource allocation
3
LCG Project, Service Challenges 3 LCG Service Challenges – ramp up to LHC start-up service SC2 SC3 LHC Service Operation Full physics run 200520072006 2008 First physics First beams cosmics June05 - Technical Design Report Sep05 - SC3 Service Phase May06 – SC4 Service Phase Sep06 – Initial LHC Service in stable operation SC4 SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period) SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput Apr07 – LHC Service commissioned
4
LCG Project, Service Challenges 4 Why Service Challenges? To test Tier-0 Tier-1 Tier-2 services Network service Sufficient bandwidth: ~10 Gbit/sec Backup path Quality of service: security, help desk, error reporting, bug fixing,.. Robust file transfer service File servers File Transfer Software (GridFTP) Data Management software (SRM, dCache) Archiving service: tapeservers,taperobots, tapes, tapedrives,.. Sustainability Weeks in a row un-interrupted 24/7 operation Manpower implications: ~7 fte/site Quality of service: helpdesk, error reporting, bug fixing,.. Towards a stable production environment for experiments
5
LCG Project, Service Challenges 5 Key Principles Service challenges results in a series of services that exist in parallel with baseline production service Rapidly and successively approach production needs of LHC Initial focus: core (data management) services Swiftly expand out to cover full spectrum of production and analysis chain Must be as realistic as possible, including end-end testing of key experiment use-cases over extended periods with recovery from glitches and longer- term outages Necessary resources and commitment pre-requisite to success! Effort should not be under-estimated!
6
LCG Project, Service Challenges 6 Service Challenge 3 - Phases High level view: Throughput phase 2 weeks sustained in July 2005 “Obvious target” – GDB of July 20 th Primary goals: 150MB/s disk – disk to Tier1s; 60MB/s disk (T0) – tape (T1s) Secondary goals: Include a few named T2 sites (T2 -> T1 transfers) Encourage remaining T1s to start disk – disk transfers Service phase September – end 2005 Start with ALICE & CMS, add ATLAS and LHCb October/November All offline use cases except for analysis More components: WMS, VOMS, catalogs, experiment-specific solutions Implies production setup (CE, SE, …)
7
LCG Project, Service Challenges 7 Basic Components For Setup Phase Each T1 to provide 10Gb network link to CERN Each T1 + T0 to provide SRM 1.1 interface to managed storage This goes for the named T2s for the T2-T1 transfer tests too T0 to provide File Transfer Service; also at named T1s for T2-T1 transfer tests Baseline Services Working Group, Storage Management Workshop and SC3 Preparation Discussions have identified one additional data management service for SC3, namely the LFC Not all experiments (ALICE) intend to use this Nor will it be deployed for all experiments at each site However, as many sites support multiple experiments, and will (presumably) prefer to offer common services, this can be considered a basic component
8
LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 8 SC timescale implications SC3 will involve the Tier 1 sites (+ a few large Tier 2) in July Must have the release to be used in SC3 available in mid-June Involved sites must upgrade for July Not reasonable to expect those sites to commit to other significant work (pre-production etc) on that timescale T1: ASCC, BNL, CCIN2P3, CNAF, FNAL, GridKA, NIKHEF/SARA, RAL and Expect SC3 release to include FTS, LFC, DPM, but otherwise be very similar to LCG-2.4.0 September-December: experiment “production” verification of SC3 services; in parallel set up for SC4 Expect “normal” support infrastructure (CICs, ROCs, GGUS) to support service challenge usage Bio-med also planning data challenges Must make sure these are all correctly scheduled
9
Workshop goals
10
LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 10 Operations issues – 1 Metrics We need a complete set of agreed metrics that: Are publicly available, show evolution and history Measure operations performance: reliability of service, reliability of sites, responsiveness to problems, failure rates, downtime/availability etc., etc. Measure quality of service for the overall service and for individual sites “Scheduled downtime” is still downtime … Deployment timescales and latency of upgrades Deployment of releases takes much too long Should be part of a sites’ quality of service metric What is the problem and how can this be improved? General responsiveness of sites to problems How can this be improved? Can the ROCs help? (they should!)
11
LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 11 Operations issues – 2 Release strategy for gLite/LCG-2.x.x/SC3 Should be presented, Must be discussed, agreed, and committed to by the sites Resource allocation to new VOs Is still not resolved. This may be a deeper issue related to funding of resources, but there is an expectation that sites within EGEE provide some minimal level of resource to new applications – this is not happening very much. The workshop should try and understand if this is a real issue for the sites, what is the reluctance to provide resources to new VOs? Joint operations with OSG Can we identify specific areas of collaboration? Common tools, common procedures, problem tracking How close can we get to the idea of non-prime shift operational support of each others’ grid service? Would presumably need common tools and procedures
12
LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 12 User support The current user support infrastructure is not viewed as effective By most users By grid experts who need to be part of the process (NB mostly they are via the rollout-list) We (all of the stakeholders) need to re-think how user support should be done: We do need a managed process with problem tracking and management But it should be as simple as possible It should provide a simple way to get to existing expertise (as in the rollout list), but encourage others to contribute It must work effectively for the users!! This is a simple test – do the users use it because they recognise this is the way to get their problems addressed? This workshop must agree the way forward for user support In a very aggressive manner – there is almost no credibility left
13
LCG/EGEE Operations Workshop, Bologna 24-26 May 2005 13 Fabric management Badly or un- managed sites remain a source of operations problems Last workshop recognised this Very little progress since Need to start producing fabric management cookbook(s) As proposed in the last workshop Needed as part of SA1 deliverable This workshop must: Agree what is required Provide the plan, and find people to work on this We need drafts of these within the next couple of months Identify other ways to improve site stability and reliability
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.