WLCG Collaboration Workshop: Outlook for 2009 – 2010

Slides:



Advertisements
Similar presentations
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Advertisements

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
WLCG Service Schedule June 2007.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
WLCG Service Report ~~~ WLCG Management Board, 7 th April 2009.
WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.
Ian Bird LCG Project Leader WLCG Collaboration Issues WLCG Collaboration Board 24 th April 2008.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
WLCG Service Schedule LHC schedule: what does it imply for SRM deployment? WLCG Storage Workshop CERN, July 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
LCG Introduction John Gordon, STFC-RAL GDB November 7th, 2007.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.
WLCG Service Report ~~~ WLCG Management Board, 15 th December
CERN IT Department CH-1211 Geneva 23 Switzerland t Service Reliability & Critical Services January 15 th 2008.
WLCG IPv6 deployment strategy
Deployment timelines LHCb CMS ATLAS 2007 Dec Nov Oct Sep Aug Jul Jun
Jan 2016 Solar Lunar Data.
Ian Bird WLCG Workshop San Francisco, 8th October 2016
The LHC Computing Environment
LCG Service Challenge: Planning and Milestones
WLCG Management Board, 30th September 2008
gLite->EMI2/UMD2 transition
Flavia Donno CERN GSSD Storage Workshop 3 July 2007
Update on Plan for KISTI-GSDC
Taming the protocol zoo
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
~~~ LCG-LHCC Referees Meeting, 16th February 2010
WLCG Service Interventions
John Gordon, STFC-RAL GDB March 11, 2009
Update from the HEPiX IPv6 WG
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
Project Status Report Computing Resource Review Board Ian Bird
WLCG Service Report 5th – 18th July
WLCG and support for IPv6-only CPU

LHC Data Analysis using a worldwide computing grid

Gantt Chart Enter Year Here Activities Jan Feb Mar Apr May Jun Jul Aug
WLCG Roadmap for Prague, 21st March 2009
Q1 Q2 Q3 Q4 PRODUCT ROADMAP TITLE Roadmap Tagline MILESTONE MILESTONE


Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Q1 Q2 Q3 Q4 PRODUCT ROADMAP TITLE Roadmap Tagline MILESTONE MILESTONE
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
MB Maarten Litmaath CERN v1.0
TIMELINE NAME OF PROJECT Today 2016 Jan Feb Mar Apr May Jun

Q1 Q2 Q3 Q4 PRODUCT ROADMAP TITLE Roadmap Tagline MILESTONE MILESTONE
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Collaboration Workshop: Outlook for 2009 – 2010 Jamie.Shiers@cern.ch ~~~ WLCG Grid Deployment Board, April 2009

Workshop Summary

Key Messages We have a good understanding of what “WLCG Operations” means and have run this model in over the past year (and more…) We should (obviously) use this both for STEP’09 as well as 2009 / 2010 data taking, (re-)processing, analysis etc. It works! Is sustainable! But no doubt can / will be improved… SCOD rota has started…

Introduction This report is primarily about the service since last week’s MB, but I would like to start with a brief summary of the quarter based on the draft contribution to the LCG QR Although this is not an exhaustive list of all serious problems in the quarter, we can still draw some conclusions from it: Site When Issue CERN 08/01 Many user jobs killed on lxbatch due to memory problems 17/01 FTS transfer problems for ATLAS 23/01 FTS / SRM / CASTOR transfer problems for ATLAS 26/01 Backwards incompatible change on SRM affected ATLAS / LHCb 27/02 Accidental deletion of RAID volumes in C2PUBLIC 04/03 General CASTOR outage for 3 hours 14/03 CASTOR ATLAS outage for 12 hours CNAF 21/02 Network outage to Tier2s and some Tier1s FZK 24/01 FTS & LFC down for 3 days ASGC 25/02 Fire affecting all site – services temporarily relocated RAL 24/03 Site down after power glitches. Knock-on effects for several days

QR – Conclusions (1/2) Not all sites are yet reporting problems consistently – some appear ‘only’ in broadcast messages which makes it very hard to track and (IMHO) impossible to learn If you don’t learn you are destined to repeat e.g. from a single joint operations meeting (Indico 55792) - here SARA: OUTAGE: From 02:00 4 April to 02:00 5 April. Service: dCache SE. SARA: OUTAGE: From 09:30 30 March to 21:00 30 March. Service: srm.grid.sara.nl. SARA: OUTAGE: From 15:13 27 March to 02:00 31 March. Service: celisa.grid.sara.nl. Fileserver malfunction. CERN: At Risk: From 11:00 31 March to 12:00 31 March. Service: VOMS (lcg-voms.cern.ch). FZK: OUTAGE: From 14:21 30 March to 20:00 30 March. Service: fts-fzk.gridka.de INFN-CNAF: OUTAGE: From 02:00 28 March to 19:00 3 April. Service: ENTIRE SITE. INFN-T1: OUTAGE: From 16:00 27 March to 17:00 3 April. Service: ENTIRE SITE. NDGF-T1: At risk: From 12:31 27 March to 16:31 30 March. Service: srm.ndgf.org (ATLAS). NDGF-T1: At risk: From 12:31 27 March to 13:27 31 March. Service: ce01.titan.uio.no. As per previous estimates, one site outage per month (Tier0+Tiers1) due to power and cooling is to be expected It is very important to find some track of these through the daily operations meetings and weekly summaries We must improve on this in the current (STEP’09) quarter – all significant service / site problems need to be reported and some minimal analysis – as discussed at the WLCG Collaboration workshop – provided spontaneously I believe that there should be some SERVICE METRICS – as well as experiment metrics – for STEP’09 which should reflect the above See GDB tomorrow – they are not new by the way!

Site Metric(s) We can assume a small number of power, cooling and other infrastructure-related problems Important to see improvements in ordered recovery and communication! ALL major problems should be reported – see Olof’s presentation at pre-CHEP workshop An EGEE broadcast is not enough! Target: zero (0) major service interruptions or degradations for which no “Service Incident Report” is produced The number and type of events should preferably be lower than Q1 2009 (but STEP’09 activity may preclude this…)

Pre-STEP Planning Are there any key features that the experiments are waiting for? e.g. new LFC bulk methods for ATLAS: targetted for April 30 ATLAS have certified these already AFAIK Precise storage-ware versions: should avoid using version X for STEP 09 and Y (with different feature set) for pp data taking! No time for – or need of(?) – dedicated meetings a la CCRC’08 Pursue discussion on analysis support as kicked off at workshop in a pre-GDB, e.g. May? Foresee a small number of LCG Service Coordination Meetings at CERN – these have been suspended since late 2008 (service is globally rather stable…) CMS-style report for daily operations meeting useful – other experiments encouraged to adopt this https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports Expect representation per experiment all / most days: this is not always the case today, particularly for ALICE & LHCb (typically represented by IT-GS)

WLCG timeline 2009-2010 ?? Workshops EGEE-III ends EGI ... ??? 2009 2010 2011 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec SU pp running HI? STEP’09 May + June 2009 Capacity commissioned 2010 Capacity commissioned Switch to SL5/64bit completed? A pre-CHEP 2010 (17-22 Oct) workshop probably does not make sense, nor does a more restricted A-P event (IMHOA) Deployment of glexec/SCAS; CREAM; SRM upgrades; SL5 WN

Summary WLCG has been in production mode for a long time now – nothing changes here! We should [ continue to ] follow the service closely, monitoring significant degradations and outages We would like to see an improvement quarter by quarter – even with increased load – for the main production services Analysis support will bring new problems but good progress was reported at the workshop – we should actively follow-up on this! (e.g. possible pre-GDB or other meeting ~May) [ IMHO we – WLCG operations – should engage with the nascent EGI organization and make sure nothing breaks! ]