Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.

Slides:



Advertisements
Similar presentations
Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
Advertisements

HEPiX Edinburgh 28 May 2004 LCG les robertson - cern-it-1 Data Management Service Challenge Scope Networking, file transfer, data management Storage management.
The LCG Service Challenges: Experiment Participation Jamie Shiers, CERN-IT-GD 4 March 2005.
Les Les Robertson WLCG Project Leader WLCG – Worldwide LHC Computing Grid Where we are now & the Challenges of Real Data CHEP 2007 Victoria BC 3 September.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.
Grid Deployment Data challenge follow-up & lessons learned Ian Bird LCG Deployment Area Manager LHCC Comprehensive Review 22 nd November 2004.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
LHC Data Challenges and Physics Analysis Jim Shank Boston University VI DOSAR Workshop 16 Sept., 2005.
Δ Storage Middleware GridPP10 What’s new since GridPP9? CERN, June 2004.
GridPP Deployment Status GridPP14 Jeremy Coles 6 th September 2005.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Deployment issues and SC3 Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1 st June 2005.
SC4 Planning Planning for the Initial LCG Service September 2005.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
LCG Storage Management Workshop - Goals and Timeline of SC3 Jamie Shiers, CERN-IT-GD April 2005.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
ARDA Massimo Lamanna / CERN Massimo Lamanna 2 TOC ARDA Workshop Post-workshop activities Milestones (already shown in December)
Planning the Planning for the LCG Service Challenges Jamie Shiers, CERN-IT-GD 28 January 2005.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
The Worldwide LHC Computing Grid WLCG Milestones for 2007 Focus on Q1 / Q2 Collaboration Workshop, January 2007.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
LCG Service Challenges: Progress Since The Last One –
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Baseline Services Group Status of File Transfer Service discussions Storage Management Workshop 6 th April 2005 Ian Bird IT/GD.
T0-T1 Networking Meeting 16th June Meeting
Evolution of storage and data management
Bob Jones EGEE Technical Director
WLCG IPv6 deployment strategy
WLCG Tier-2 Asia Workshop TIFR, Mumbai 1-3 December 2006
Ian Bird WLCG Workshop San Francisco, 8th October 2016
“A Data Movement Service for the LHC”
The LHC Computing Environment
Support Operation Challenge – 1 SOC-1 Alistair Mills Torsten Antoni
LCG Service Challenge: Planning and Milestones
gLite->EMI2/UMD2 transition
Ian Bird GDB Meeting CERN 9 September 2003
Service Challenge 3 CERN
Data Challenge with the Grid in ATLAS
Database Readiness Workshop Intro & Goals
Update on Plan for KISTI-GSDC
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Olof Bärring LCG-LHCC Review, 22nd September 2008
Update from the HEPiX IPv6 WG
The LCG Service Challenges: Ramping up the LCG Service
Operating the World’s largest grid infrastructure
Bernd Panzer-Steindel CERN/IT
Data Management cluster summary
Summary of Service Challenges
LCG Service Challenges Overview
LHC Data Analysis using a worldwide computing grid
Pierre Girard ATLAS Visit
WLCG Collaboration Workshop: Outlook for 2009 – 2010
Overview & Status Al-Ain, UAE November 2007.
The LHCb Computing Data Challenge DC06
Presentation transcript:

Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005

LCG/EGEE Operations Workshop, Bologna May Outline  Service challenges  Goals of workshop:  Operations issues  User support  Fabric management  Release strategy for gLite  Joint operations OSG/EGEE  Resource allocation

LCG Project, Service Challenges 3 LCG Service Challenges – ramp up to LHC start-up service SC2 SC3 LHC Service Operation Full physics run First physics First beams cosmics June05 - Technical Design Report Sep05 - SC3 Service Phase May06 – SC4 Service Phase Sep06 – Initial LHC Service in stable operation SC4 SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period) SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput Apr07 – LHC Service commissioned

LCG Project, Service Challenges 4 Why Service Challenges? To test Tier-0  Tier-1  Tier-2 services  Network service  Sufficient bandwidth: ~10 Gbit/sec  Backup path  Quality of service: security, help desk, error reporting, bug fixing,..  Robust file transfer service  File servers  File Transfer Software (GridFTP)  Data Management software (SRM, dCache)  Archiving service: tapeservers,taperobots, tapes, tapedrives,..  Sustainability  Weeks in a row un-interrupted 24/7 operation  Manpower implications: ~7 fte/site  Quality of service: helpdesk, error reporting, bug fixing,..  Towards a stable production environment for experiments

LCG Project, Service Challenges 5 Key Principles  Service challenges results in a series of services that exist in parallel with baseline production service  Rapidly and successively approach production needs of LHC  Initial focus: core (data management) services  Swiftly expand out to cover full spectrum of production and analysis chain  Must be as realistic as possible, including end-end testing of key experiment use-cases over extended periods with recovery from glitches and longer- term outages  Necessary resources and commitment pre-requisite to success!  Effort should not be under-estimated!

LCG Project, Service Challenges 6 Service Challenge 3 - Phases High level view:  Throughput phase  2 weeks sustained in July 2005  “Obvious target” – GDB of July 20 th  Primary goals:  150MB/s disk – disk to Tier1s;  60MB/s disk (T0) – tape (T1s)  Secondary goals:  Include a few named T2 sites (T2 -> T1 transfers)  Encourage remaining T1s to start disk – disk transfers  Service phase  September – end 2005  Start with ALICE & CMS, add ATLAS and LHCb October/November  All offline use cases except for analysis  More components: WMS, VOMS, catalogs, experiment-specific solutions  Implies production setup (CE, SE, …)

LCG Project, Service Challenges 7 Basic Components For Setup Phase  Each T1 to provide 10Gb network link to CERN  Each T1 + T0 to provide SRM 1.1 interface to managed storage  This goes for the named T2s for the T2-T1 transfer tests too  T0 to provide File Transfer Service; also at named T1s for T2-T1 transfer tests  Baseline Services Working Group, Storage Management Workshop and SC3 Preparation Discussions have identified one additional data management service for SC3, namely the LFC  Not all experiments (ALICE) intend to use this  Nor will it be deployed for all experiments at each site  However, as many sites support multiple experiments, and will (presumably) prefer to offer common services, this can be considered a basic component

LCG/EGEE Operations Workshop, Bologna May SC timescale implications  SC3 will involve the Tier 1 sites (+ a few large Tier 2) in July  Must have the release to be used in SC3 available in mid-June  Involved sites must upgrade for July  Not reasonable to expect those sites to commit to other significant work (pre-production etc) on that timescale  T1: ASCC, BNL, CCIN2P3, CNAF, FNAL, GridKA, NIKHEF/SARA, RAL and  Expect SC3 release to include FTS, LFC, DPM, but otherwise be very similar to LCG  September-December: experiment “production” verification of SC3 services; in parallel set up for SC4  Expect “normal” support infrastructure (CICs, ROCs, GGUS) to support service challenge usage  Bio-med also planning data challenges  Must make sure these are all correctly scheduled

Workshop goals

LCG/EGEE Operations Workshop, Bologna May Operations issues – 1  Metrics  We need a complete set of agreed metrics that:  Are publicly available, show evolution and history  Measure operations performance: reliability of service, reliability of sites, responsiveness to problems, failure rates, downtime/availability etc., etc.  Measure quality of service for the overall service and for individual sites  “Scheduled downtime” is still downtime …  Deployment timescales and latency of upgrades  Deployment of releases takes much too long  Should be part of a sites’ quality of service metric  What is the problem and how can this be improved?  General responsiveness of sites to problems  How can this be improved?  Can the ROCs help? (they should!)

LCG/EGEE Operations Workshop, Bologna May Operations issues – 2  Release strategy for gLite/LCG-2.x.x/SC3  Should be presented,  Must be discussed, agreed, and committed to by the sites  Resource allocation to new VOs  Is still not resolved. This may be a deeper issue related to funding of resources, but there is an expectation that sites within EGEE provide some minimal level of resource to new applications – this is not happening very much.  The workshop should try and understand if this is a real issue for the sites, what is the reluctance to provide resources to new VOs?  Joint operations with OSG  Can we identify specific areas of collaboration?  Common tools, common procedures, problem tracking  How close can we get to the idea of non-prime shift operational support of each others’ grid service?  Would presumably need common tools and procedures

LCG/EGEE Operations Workshop, Bologna May User support  The current user support infrastructure is not viewed as effective  By most users  By grid experts who need to be part of the process (NB mostly they are via the rollout-list)  We (all of the stakeholders) need to re-think how user support should be done:  We do need a managed process with problem tracking and management  But it should be as simple as possible  It should provide a simple way to get to existing expertise (as in the rollout list), but encourage others to contribute  It must work effectively for the users!!  This is a simple test – do the users use it because they recognise this is the way to get their problems addressed?  This workshop must agree the way forward for user support  In a very aggressive manner – there is almost no credibility left

LCG/EGEE Operations Workshop, Bologna May Fabric management  Badly or un- managed sites remain a source of operations problems  Last workshop recognised this  Very little progress since  Need to start producing fabric management cookbook(s)  As proposed in the last workshop  Needed as part of SA1 deliverable  This workshop must:  Agree what is required  Provide the plan, and find people to work on this  We need drafts of these within the next couple of months  Identify other ways to improve site stability and reliability