CCRC Status & Readiness for Data Taking ~ June 2008 ~ June F2F Meetings.

Slides:

Advertisements

Similar presentations

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.

Advertisements

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

INFSO-RI Enabling Grids for E-sciencE SRMv2.2 experience Sophie Lemaitre WLCG Workshop.

WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,

ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.

SRM 2.2: tests and site deployment 30 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.

SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.

CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

2 Sep Experience and tools for Site Commissioning.

CCRC’08 Weekly Update ~~~ WLCG Management Board, 27 th May 2008.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

Database Administrator RAL Proposed Workshop Goals Dirk Duellmann, CERN.

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.

LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

LCG CCRC’08 Status WLCG Management Board November 27 th 2007

SC4 Planning Planning for the Initial LCG Service September 2005.

WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.

Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

CCRC – Conclusions from February and update on planning for May Jamie Shiers ~~~ WLCG Overview Board, 31 March 2008.

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

Ian Bird LCG Project Leader WLCG Status Report CERN-RRB th April, 2008 Computing Resource Review Board.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).

WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

Outcome of CCRC’08 Post-Mortem ~~~ LCG-LHCC Mini Review 1 st July 2008 CCRC2008

LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

WLCG Collaboration Workshop 21 – 25 April 2008, CERN Remaining preparations GDB, 2 nd April 2008.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

LHCbComputing Update of LHC experiments Computing & Software Models Selection of slides from last week’s GDB

Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.

Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

WLCG IPv6 deployment strategy

The CREAM CE: When can the LCG-CE be replaced?

~~~ LCG-LHCC Referees Meeting, 16th February 2010

WLCG Service Interventions

Presentation transcript:

CCRC Status & Readiness for Data Taking ~~~ June 2008 ~~~ June F2F Meetings

Are We Ready? Based on the experience in the February and May runs of the Common Computing Readiness Challenge, an obvious question is ¿“Are we ready for LHC Data Taking?” any time mid-July 2008 on… The honest answer is probably: “We are ready to face LHC Data Taking” There are subtle, but important, differences between the two… 2

CCRC’08 – Objectives Primary objective (next) was to demonstrate that we (sites, experiments) could run together at 2008 production scale  This includes testing all “functional blocks”: Experiment to CERN MSS; CERN to Tier1; Tier1 to Tier2s etc. Two challenge phases were foreseen: 1.February : not all 2008 resources in place – still adapting to new versions of some services (e.g. SRM) & experiment s/w 2.May:all 2008 resources in place – full 2008 workload, all aspects of experiments’ production chains  N.B. failure to meet target(s) would necessarily result in discussions on de-scoping! Fortunately, the results suggest that this is not needed, although much still needs to be done before, and during, May! 3

On WLCG Readiness The service runs smoothly – most of the time Problems are typically handled rather rapidly, with a decreasing number that require escalation Most importantly, we have a well-proved “Service Model” that allows us to handle anything from “Steady State” to “Crisis” situations We have repeatedly proven that we can – typically rather rapidly – work through even the most challenging “Crisis Situation” Typically, this involves short-term work-arounds with longer term solutions It is essential that we all follow the “rules” (rather soft…) of this service model which has proven so effective… 4

Pros & Cons – Managed Services Predictable service level and interventions; fewer interventions, lower stress level and more productivity, good match of expectations with reality, steady and measurable improvements in service quality, more time to work on the physics, more and better science, …  Stress, anger, frustration, burn- out, numerous unpredictable interventions, including additional corrective interventions, unpredictable service level, loss of service, less time to work on physics, less and worse science, loss and / or corruption of data, … We need to be here. Right?

CCRC ’08 – How Does it Work? Experiment “shifters” use Dashboards, experiment-specific SAM- tests (+ other monitoring, e.g. PhEDEx) to monitor the various production activities Problems spotted are reported through the agreed channels (ticket + elog entry) Response is usually rather rapid – many problems are fixed in (<)< 1 hour! A small number of problems are raised at the daily (15:00) WLCG operations meeting  Basically, this works! We review on a weekly basis if problems were not spotted by the above  fix [ + MB report ]  With time, increase automation, decrease eye-balling 6

Baseline Versions for May CCRC’08 Storage-ware – CCRC’08 Versions by Implementation CASTOR: SRM: v , b/e: dCache: , p1, p2, p3 (cumulative) DPM: (see below) StoRM M/W component Patch #Status LCG CE Patch #1752Released gLite 3.1 Update 20 FTS (T0) FTS (T1) Patch #1740 Patch #1671 Released gLite 3.0 Update 42 Released gLite 3.0 Update 41 gFAL/lcg_utils Patch #1738Released gLite 3.1 Update 20 DPM Patch #1706Released gLite 3.1 Update 18 Missed FTM – required at WLCG Tier1s Missed DBs – a similar table is needed!

Baseline Versions for May CCRC’08 Storage-ware – CCRC’08 Versions by Implementation CASTOR: SRM: v , b/e: dCache: , p1, p2, p3 (cumulative) DPM: (see below) StoRM M/W component Patch #Status LCG CE Patch #1752Released gLite 3.1 Update 20 FTS (T0) FTS (T1) Patch #1740 Patch #1671 Released gLite 3.0 Update 42 Released gLite 3.0 Update 41 FTM Patch #1458 Released gLite 3.1. Update 10 gFAL/lcg_utils Patch #1738Released gLite 3.1 Update 20 DPM Patch #1706Released gLite 3.1 Update 18

CCRC’08 – Conclusions (LHCC referees) WLCGThe WLCG service is running (reasonably) smoothly The functionality matches: what has been tested so far – and what is (known to be) required  We have a good baseline on which to build (Big) improvements over the past year are a good indication of what can be expected over the next! (Very) detailed analysis of results compared to up-front metrics – in particular from experiments! 9

WLCG Operations I am sure that most – if not all – of you are familiar with: Daily “CERN” operations meetings at 09:00 in 513 R-068 Daily “CCRC” operations meetings at 15:00 in 28 R-006 (will move to 513 R-068 if the phone is ever connected) In addition, the experiments have (twice-)daily operations meetings  continuous operations in control rooms These are pretty light-weight and a (long) proven way of ensuring information flow / problem dispatching 10

The WLCG Service Is a combination of services, procedures, documentation, monitoring, alarms, accounting, … and is based on the “WLCG Service Model”. It includes both “experiment-specific” as well as “baseline” components No-one (that I’ve managed to find…) has a complete overview of all aspects  It is essential that we work together and feel joint ownership – and pride – in the complete service It is on this that we will be judged – and not individual components… (I’m sure that I am largely preaching to the choir…) 11

CCRC ’08 – Areas of Opportunity Tier2s: MC well run in, distributed analysis still to be scaled up to (much) larger numbers of users Tier1s: data transfers (T0-T1, T1-T1, T1-T2, T2-T1) now well debugged and working sufficiently well (most of the time…); reprocessing still needs to be fully demonstrated for ATLAS (includes conditions!!!)  Tier0: best reviewed in terms of the experiments’ “Critical Services” lists These strongly emphasize data/storage management and database services!  We know how to run stable, reliable services IMHO – these take less effort to run than ‘unreliable’ ones…  But they require some minimum amount of discipline… 12

Data / Storage Management  There is no doubt that this is a complex area And one which is still changing We don’t really know what load and / or access patterns ‘real’ production will give But if there is one area that can be expected to be ‘hot’ it is this one… IMHO (again) it is even more important that we remain calm and disciplined – “fire-fighting” mode doesn’t work well – except possibly for very short term issues  In the medium term, we can expect (significant?) changes in the way we do data management  Given our experience e.g. with SRM v2.2, how can this be achieved in a non-disruptive fashion, at the same time as (full-scale?) data-taking and production??? 13

CCRC ’09 - Outlook SL(C)5 CREAM Oracle 11g SRM v2.2++  Other DM fixes… SCAS [ new authorization framework ] … 2009 resources 2009 hardware Revisions to Computing Models EGEE III transitioning to more distributed operations Continued commissioning, 7+7 TeV, transitioning to normal(?) data-taking (albeit low luminosity?) New DG, … 14

Count-down to LHC Start-up! Experiments have been told to be ready for beam(s) at any time from the 2 nd half of July CCRC ’08 [ ran ] until the end of May… Basically, that means we have ONE MONTH to make any necessary remaining preparations / configuration changes for first data taking (which won’t have been tested in CCRC…)  There are quite a few outstanding issues – not clear if these are yet fully understood  There must be some process for addressing these If the process is the ‘daily CASTOR+SRM operations meeting’, then each sub-service absolutely must be represented!  Almost certainly too many / too complex to address them all – prioritize and address systematically. Some will simply remain as “features” to be addressed later – and tested in CCRC ’09? 15

Post-Mortems Required RAL power micro-cut (8.5 h downtime of CASTOR) See next slide NIKHEF cooling problems (4 day downtime of WNs) CERN CASTOR + SRM problems The postmortem of the CERN-PROD SRM problems on the Saturday 24/5/2008 (morning) can be found at The problem affected all endpoints. Problems on June 5 th (5 hour downtime): logger.cern.ch/elog/Data+Operations/13https://prod-grid- logger.cern.ch/elog/Data+Operations/13 16

RAL Power Cut We lost power on one phase at about 07:00, but by the time pagers went off on-call staff were already in transit to RAL and were not able to respond until normal start of working day (which is within our 2 hour target out of hours). We suffered a very short (I am told milliseconds) loss of power/spike that took out one whole phase. As we have no UPS at RAL (will have in new machine room) this caused crash/reboot of over 1/3 of our disk servers. Restart commenced about 09:00 CASTOR Databases Ready 12:00 Disk Servers Ready 13:45 Last CASTOR Instance Restarted 16:44 So - about 2 hours because we were out of hours and had to respond and assess. 3 hours for ORACLE concurrent with 4:45 for clean disk server restart 3 hours for CASTOR restart and testing before release  Additionally, experience highlights the potential gap in the on- call system when people are in transit 17

CCRC’08 – Conclusions WLCGThe WLCG service is running (reasonably) smoothly The functionality matches: what has been tested so far – and what is (known to be) required & the experiments are happy!  We have a good baseline on which to build (Big) improvements over the past year are a good indication of what can be expected over the next! (Very) detailed analysis of results compared to up-front metrics – in particular from experiments! 18