CCRC’08 Planning & Requirements Jamie Shiers ~~~ LHC OPN, 10 th March 2008.

Slides:

Advertisements

Similar presentations

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.

Advertisements

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.

DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.

WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

Service Transition & Planning Service Validation & Testing

ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals

SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.

Event Management & ITIL V3

CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.

GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

LCG CCRC’08 Status WLCG Management Board November 27 th 2007

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

WLCG Planning Issues GDB June Harry Renshall, Jamie Shiers.

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

Julia Andreeva on behalf of the MND section MND review.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.

Update on HEP SSC WLCG MB, 6 th July 2009 Jamie Shiers Grid Support Group IT Department, CERN.

CCRC – Conclusions from February and update on planning for May Jamie Shiers ~~~ WLCG Overview Board, 31 March 2008.

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.

LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.

SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).

WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

WLCG Collaboration Workshop 21 – 25 April 2008, CERN Remaining preparations GDB, 2 nd April 2008.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.

Jamie Shiers ~~~ WLCG MB, 19th February 2008

Olof Bärring LCG-LHCC Review, 22nd September 2008

1 VO User Team Alarm Total ALICE ATLAS CMS

Presentation transcript:

CCRC’08 Planning & Requirements Jamie Shiers ~~~ LHC OPN, 10 th March 2008

Agenda Common Computing Readiness Challenge (CCRC’08) – What is it? Who does it concern? Why? Brief reminder of Computing Models of LHC experiments – what has changed Status & Outlook Lessons Learned Conclusions 2

Background For many years, the LHC experiments have been preparing for data taking On the Computing side, this has meant a series of “Data Challenges” designed to verify their computing models and offline software / production chains To a large extent, these challenges have been independent of each other, whereas in reality, we (almost all sites) have to support (almost all experiments) simultaneously  Are there some bottlenecks or unforeseen couplings between the experiments and / or the services?  There certainly is at the level of support personnel! 3

G. Dissertori 5 LHC: One Ring to Bind them… Introduction Status of LHCb ATLAS ALICE CMS Conclusions LHC : 27 km long 100m underground ATLAS General Purpose, pp, heavy ions CMS +TOTEM Heavy ions, pp ALICE pp, B-Physics, CP Violation

6

LHC Computing is Complicated! Despite high-level diagrams (next), the Computing TDRs and other very valuable documents, it is very hard to maintain a complete view of all of the processes that form part of even one experiment’s production chain Both detailed views of the individual services, together with the high-level “WLCG” view are required… It is ~impossible (for an individual) to focus on both…  Need to work together as a team, sharing the necessary information, aggregating as required etc.  The needed information must be logged & accessible! (Service interventions, changes etc.)  This is critical when offering a smooth service with affordable manpower 7

8 Early enthusiasts discuss LHC Computing…

September 2, 2007 M.Kasemann WLCG Workshop: Common VO Challenge 9/6 CCRC’08 – Motivation and Goals What if: –LHC is operating and experiments take data? –All experiments want to use the computing infrastructure simultaneously? –The data rates and volumes to be handled at the Tier0, the Tier1 and Tier2 centers are the sum of ALICE, ATLAS, CMS and LHCb as specified in the experiments computing model Each experiment has done data challenges, computing challenges, tests, dress rehearsals, …. at a schedule defined by the experiment This will stop: we will no longer be the master of our schedule… …. Once LHC starts to operate. We need to prepare for this … together …. A combined challenge by all Experiments should be used to demonstrate the readiness of the WLCG Computing infrastructure before start of data taking at a scale comparable to the data taking in This should be done well in advance of the start of data taking on order to identify flaws, bottlenecks and allow to fix those. We must do this challenge as WLCG collaboration: Centers and Experiments

September 2, 2007 M.Kasemann WLCG Workshop: Common VO Challenge 10/6 CCRC’08 – Proposed Scope (CMS) Test data transfers at 2008 scale: –Experiment site to CERN mass storage –CERN to Tier1 centers –Tier1 to Tier1 centers –Tier1 to Tier2 centers –Tier2 to Tier2 centers Test Storage to Storage transfers at 2008 scale: –Required functionality –Required performance Test data access at Tier0, Tier1 at 2008 scale: –CPU loads should be simulated in case this impacts data distribution and access Tests should be run concurrently CMS proposes to use artificial data –Can be deleted after the Challenge

September 2, 2007 M.Kasemann WLCG Workshop: Common VO Challenge 11/6 CCRC’08 – Constraints & Preconditions Mass storage systems are prepared –SRM2.2 deployed at all participating sites –CASTOR, dCache and other data management systems installed with appropriate version Data transfers are commissioned for CMS –Only commissioned links can be used Participating centers have 2008 capacity

September 2, 2007 M.Kasemann WLCG Workshop: Common VO Challenge 12/6 CCRC’08 – Proposed Schedule Duration of challenge: 4 weeks Based on the current CMS schedule: –Window of opportunity during February 2008 In March a full detector COSMICS Run is scheduled –With all components and magnetic field This is the the first time with the final detector geometry Document performance and lessons learned within 4 weeks.

September 2, 2007 M.Kasemann WLCG Workshop: Common VO Challenge 13/6 CCRC’08 – Proposed Organization Coordination: (1+4+nT1) WLCG overall coordination (1) –Maintains overall schedule –Coordinate the definition of goals and metrics –Coordinates regular preparation meetings –During the CCRC’08 coordinates operations meetings with experiments and sites –Coordinates the overall success evaluation Each Experiment: (4) –Coordinates the definition of the experiments goals and metrics –Coordinates experiments preparations Applications for load driving (Certified and tested before the challenge) –During the CCRC’08 coordinates the experiments operations –Coordinates the experiments success evaluation Each Tier1 (nT1) –Coordinates the Tier1 preparation and the participation –Ensures the readiness of the center at the defined schedule –Contributes to summary document

A Comparison with LEP… In January 1989, we were expecting e + e - collisions in the summer of that year… The “MUSCLE” report was 1 year old and “Computing at CERN in the 1990s” was yet to be published (July 1989)  It took quite some time for the offline environment (CERNLIB+experiment s/w) to reach maturity Some key components had not even been designed!  Major changes in the computing environment were about to strike! We had just migrated to CERNVM – the Web was around the corner, as was distributed computing (SHIFT) (Not to mention OO & early LHC computing!) 14

15 – CHEP2K - Padua Startup woes – BaBar experience

CCRC’08 Preparations… Monthly Face-to-Face meetings held since time of “kick-off” during WLCG Collaboration workshop in BC Fortnightly con-calls with A-P sites started in January 2008 Weekly planning con-calls  suspended during February: restart? Daily “operations” 15:00 started mid-January  Quite successful in defining scope of challenge, required services, setup & configuration at sites…  Communication – including the tools we have – remains a difficult problem… but… Feedback from sites regarding the information they require, plus “adoption” of common way of presenting information (modelled on LHCb) all help We are arguably (much) better prepared than for any previous challenge There are clearly some lessons for the future – both the May CCRC’08 challenge as well as longer term 16

Pros & Cons – Managed Services Predictable service level and interventions; fewer interventions, lower stress level and more productivity, good match of expectations with reality, steady and measurable improvements in service quality, more time to work on the physics, more and better science, …  Stress, anger, frustration, burn- out, numerous unpredictable interventions, including additional corrective interventions, unpredictable service level, loss of service, less time to work on physics, less and worse science, loss and / or corruption of data, … Design, Implementation, Deployment & Operation

Middle- / Storage-ware Versions The baseline versions that are required at each site were defined iteratively – particularly during December and January The collaboration between and work of the teams involved was highly focused and responsive  Some bugs took longer to fix than might be expected  Some old bugs re-appeared  Some fixes did not make it in time for kick-off Let alone pre-challenge “week of stability”  Some remaining (hot) issues with storage Very few new issues discovered! (Load related)  On occasion, lack of clarity on motivation and timescales for proposed versions  These are all issues that can be fixed relatively easily – goals for May preparation… 18

Weekly Operations Review Based on 3 agreed metrics: 1.Experiments' scaling factors for functional blocks exercised 2.Experiments' critical services lists 3.MoU targets 19 ExperimentScaling FactorsCritical ServicesMoU Targets ALICECreate GridMap based on functional blocks ATLASusing existing “dashboard”info. Use 3-monthly CMSaverages as “baseline” (to signal anomalies). LHCbNeed contact(s) in experiments to expedite this! Need to follow-up with experiments on “check-lists” for “critical services” – as well as additional tests

CCRC’08  Production Data Taking & Processing CCRC’08 leads directly into production data taking & processing  Some ‘rough edges’ are likely to remain for most of this year (at least…) An annual “pre-data-taking” exercise – again with February and May (earlier?) phases – may well make sense (CCRC’09) Demonstrate that we are ready for this year’s data taking with any revised services in place and debugged… Possibly the most important: Objectives need to be SMART: Specific, Measurable, Achievable, Realistic & Time-bounded  We are still commissioning the LHC Computing Systems but need to be no later than – and preferably ahead of – the LHC! 20

WLCG Services – In a Nutshell… Services ALLWLCGWLCG / “Grid” standardsGrid KEY PRODUCTION SERVICES+ Expert call-out by operator CASTOR/Physics DBs/Grid Data Management+ 24 x 7 on-call  Summary slide on WLCG Service Reliability shown to OB/MB/GDB during December 2007 On-call service established beginning February 2008 for CASTOR/FTS/LFC (not yet backend DBs) Grid/operator alarm mailing lists exist – need to be reviewed & procedures documented / broadcast

Critical Service Follow-up Targets (not commitments) proposed for Tier0 services Similar targets requested for Tier1s/Tier2s Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable) The MoU lists targets for responding to problems (12 hours for T1s) ¿Tier1s: 95% of problems resolved <1 working day ? ¿Tier2s: 90% of problems resolved < 1 working day ?  Post-mortem triggered when targets not met! 22 Time IntervalIssue (Tier0 Services)Target End 2008Consistent use of all WLCG Service Standards100% 30’Operator response to alarm / call to x501199% 1 hourOperator response to alarm / call to x % 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99%

Tier0 – Tier1 Data Export We need to sustain 2008-scale exports for at least ATLAS & CMS for at least two weeks The short experience that we have is not enough to conclude that this is a solved problem The overall system still appears to be too fragile – sensitive to ‘rogue users’ (what does this mean?) and / or DB de-tuning  (Further) improvements in reporting, problem tracking & post-mortems needed to streamline this area We need to ensure that this is done to all Tier1 sites at the required rates and that the right fraction of data is written to tape Once we are confident that this can be done reproducibly, we need to mix- in further production activities  If we have not achieved this by end-February, what next?  Continue running in March & April – need to demonstrate exports at required rates for weeks at a time – reproducibly! Re-adjust targets to something achievable? e.g. reduce from assumed 55% LHC efficiency to 35%? 23

24

25

Recommendations To improve communications with Tier2s and the DB community, 2 new mailing lists have been setup, as well as regular con-calls with Asia-Pacific sites (time zones…) Follow-up on the lists of “Critical Services” must continue, implementing not only the appropriate monitoring, but also ensuring that the WLCG “standards” are followed for Design, Implementation, Deployment and Operation Clarify reporting and problem escalation lines (e.g. operator call-out triggered by named experts, …) and introduce (light- weight) post-mortems when MoU targets not met  We must continue to improve on open & transparent reporting, as well as further automations in monitoring, logging & accounting  We should foresee “data taking readiness” challenges in future years – probably with a similar schedule to this year – to ensure that full chain (new resources, new versions of experiment + AA s/w, middleware, storage-ware) is ready 26

And Record Openly Any Problems… The intervention is now complete and tier1 and tier2 services are operational again except for enabling of internal scripts. Two problems encountered. 1.A typo crept in somewhere, dteam became deam in the configuration. Must have happened a while ago and was a reconfiguration problem waiting to happen. 2.fts103 when rebooted for the kernel upgrade (as were the rest) decided it wanted to reinstall itself instead and failed since not a planned install. Again an accident waiting to happen. Something to check for next time. Consequently the tiertwo service is running in degraded with only one webservice box. If you had to choose a box for this error to occur on it would be this one. Should be running non-degraded mode sometime later this afternoon. People are actively using the elog-books – even though we will have to review overlap with other tools, cross-posting etc. 27

What Has Changed? (wrt 2005…)  View of Computing Models was clearly too simplistic These have evolved with experience – and will probably continue to evolve during first data taking… Various activities at the pit / Tier0 “smooth out” peaks & troughs from accelerator cycle and desynchronize the experiments from each other Each merging of small files, first-pass processing, … Currently assuming 50Ks / 24h accelerator operation Even though accelerator operations assumes 35%... Bulk (pp) data is driven by ATLAS & CMS – their models still differ significantly ATLAS has ~twice the number of Tier1s and keeps ~2(.8) copies of RAW and derived data across these. Tier1 sites are “paired” so that output from re- processing must be sent between Tier1s. They also use the “FTS” to deliver calibration data (KB/s) to some sites CMS does not make “hard” associations between Tier2s and Tier1s – for reliability (only), a Tier2 may fetch (or store?) data from any accessible Tier1  All experiments have understood – and demonstrated – “catch-up” – buffers required “everywhere” to protect against “long weekend” effects 28

Preparations for May and beyond… Aim to agree on baseline versions for May during April’s F2F meetings Based on versions as close to production as possible at that time (and not (pre-)pre-certification!)  Aim for stability from April 21 st at least! The start of the collaboration workshop… This gives very little time for fixes!  Beyond May we need to be working in continuous full production mode!  March & April will also be active preparation & continued testing – preferably at full-scale!  CCRC’08 “post-mortem” workshop: June

Service Summary – No Clangers! From a service point of view, things are running reasonably smoothly and progressing (reasonably) well  There are issues that need to be followed up (e.g. post-mortems in case of “MoU-scale” problems, problem tracking in general…) but these are both relatively few and reasonably well understood  But we need to hit all aspects of the service as hard as is required for 2008 production to ensure that it can handle the load! And resolve any problems that this reveals… 30

Scope & Timeline We will not achieve sustained exports from ATLAS+CMS(+others) at nominal 2008 rates for 2 weeks by end February 2009 There are also aspects of individual experiments’ work-plans that will not fit into Feb 4-29 slot  Need to continue thru March, April & beyond After all, the WLCG Computing Service is in full production mode & this is its purpose!  Need to get away from mind-set of “challenge” then “relax” – its full production, all the time! 31

LHC Outlook Super LHCLHC upgrade – “Super LHC” – now likely to be phased: 1.Replace the final focus (inner triplets) aiming at β*=0.25 m. during shutdown Improve the injector chain in steps: first a new proton linac (twice the energy, better performance) 3.Replace the booster by a Low Power Superconducting Proton Linac LPSPL, 4 GeV 4.Replace PS by PS2 (up to 50 GeV) The latter could be operational in 2017 Further, more futuristic steps could be a superconducting SPS (up to 1000 GeV) or even doubling the LHC energy. 32

Summary  “It went better than we expected but not as well as we hoped.” Sounds a little like Bilbo Baggins “A Long Expected Party”: “I don't know half of you half as well as I should like; and I like less than half of you half as well as you deserve.”  But we agreed to measure our process against quantitative metrics: Specific, Measurable, Achievable, Realistic, Timely 33

Well, How Did We Do? Remember that prior to CCRC’08 we: a)Were not confident that we were / would be able to support all aspects of all experiments simultaneously b)Had discussed possible fall-backs if this were not demonstrated The only conceivable “fall-back” was de-scoping… Now we are reasonably confident of the former Do we need to retain the latter as an option? Despite being rather late with a number of components (not desirable), things settled down reasonably well Given the much higher “bar” for May, need to be well prepared! 34

CCRC’08 Summary The preparations for this challenge have proceeded (largely) smoothly – we have both learnt and advanced a lot simply through these combined efforts As a focusing activity, CCRC’08 has already been very useful We will learn a lot about our overall readiness for 2008 data taking We are also learning a lot about how to run smooth production services in a more sustainable manner than previous challenges  It is still very manpower intensive and schedules remain extremely tight: full 2008 readiness still to be shown!  More reliable – as well as automated – reporting needed Maximize the usage of up-coming F2Fs (March, April) as well as WLCG Collaboration workshop to fully profit from these exercises  June on: continuous production mode (all experiments, all sites), including tracking / fixing problems as they occur 35

BACKUP SLIDES

Handling Problems… Need to clarify current procedures for handling problems – some mismatch of expectations with reality e.g. no GGUS TPMs on weekends / holidays / nights…  c.f. problem submitted with max. priority at 18:34 on Friday… Use of on-call services & expert call out as appropriate {alice-,atlas-}grid-alarm; {cms-,lhcb-}operator-alarm; Contacts are needed on all sides – sites, services & experiments e.g. who do we call in case of problems? Complete & open reporting in case of problems is essential! Only this way can we learn and improve!  It should not require Columbo to figure out what happened… Trigger post-mortems when MoU targets not met This should be a light-weight operation that clarifies what happened and identifies what needs to be improved for the future Once again, the problem is at least partly about communication! 37

FTS “corrupted proxies” issue The proxy is only delegated if required The condition is lifetime < 4 hours. The delegation is performed by the glite-transfer-submit CLI. The first submit client that sees that the proxy needs to be redelegated is the one that does it - the proxy then stays on the server for ~8 hours or so Default lifetime is 12 hours.  We found a race condition in the delegation - if two clients (as is likely) detect at the same time that the proxy needs to be renewed, they both try to do it and this can result in the delegation requests being mixed up - so that that what finally ends up in the DB is the certificate from one request and the key from the other. We don’t detect this and the proxy remains invalid for the next ~8 hours. The real fix requires a server side update (ongoing). The quick fix. There are two options: … [ being deployed ] 38

ATLAS CCRC’08 Problems Feb There seem to have been 4 unrelated problems causing full or partial interruption to the Tier0 to Tier1 exports of ATLAS. 1.On Thursday 14th evening the Castor CMS instance developed a problem which built up an excessive load on the server hosting the srm.cern.ch request pool. This is the SRM v1 request spool node shared between all endpoints. By 03:00 the server was at 100% cpu load. It recovered at 06:00 and processed requests till 08:10 when it stopped processing requests until 10:50. There were 2 service outings totalling 4:40 hours. S.Campana entered in the CCRC08 elog the complete failure of ATLAS exports at 10:17, in the second failure time window, and also reported the overnight failures as being from 03:30 to 05:30. This was replied to by J.Eldik at 16:50 as a 'site fixed' notification with the above explanation asking SC for confirmation from their Atlas monitoring. This was confirmed by SC in the elog at 18:30. During the early morning of 15th the operator log received several high load alarms for the server followed by a 'no contact' at 06:30. This lead to a standard ticket being opened. The server is on contract type D with importance 60. It was followed by a sysadmin at 08:30 who were able to connect via the serial console but not receive a prompt and lemon monitoring showed the high load. They requested advice on whether to reboot or not to the castor.support workflow. This was replied to at 11:16 with the diagnosis of a problem of the monitoring because of a pile-up of rfiod processes.  SRM v1.1 deployment at CERN coupled the experiments – this is not the case for SRM v2.2! 39

ATLAS problems cont 2.Another srm problem was observed by S.Campana around 18:30 on Friday. He observed connection timed out errors from srm.cern.ch for some files. He made an entry in the elog, submitted a ggus ticket and sent an to castor.support hence generating a remedy ticket. ggus tickets are not followed at the weekend nor are castor.support tickets which are handled by the weekly service manager on duty during working hours. The elog is not part of the standard operations workflow. A reply to the castor ticket was made at 10:30 on Monday 18th asking if the problem was still being seen. At this time SC replied he was unable to tell as a new problem, the failure of delegated credentials to FTS, had started. An elog entry that this problem was 'site fixed' was made at 16:50 on the 18th with the information that there was a problem on a disk server (hardware) which made several thousand files unavailable till Saturday. Apparently the server failure did not trigger its removal from Castor as it should have. This was done by hand on Saturday evening by one of the team doing regular checks. The files would then have been restaged from tape. The ggus ticket also arrived at CERN on Monday. (to be followed) 40

ATLAS problems – end. 3.There was a castoratlas interruption at on Saturday 16 Feb. This triggered an SMS to a castor support member (not the piquet) who restored the service by midnight. There is an elog entry made at 16:52 on Monday. At the time there was no operator log alarm as the repair pre-empted this. 4.For several days there have been frequent failures of FTS transfers due to corrupt delegated proxies. This has been seen at CERN and several Tier 1. It is thought to be bug that came in with a recent gLite release. This stopped ATLAS transfers on the Monday morning. The workaround is to delete the delegated proxy and its database entry. The next transfer will recreate them. This is being automated at CERN by a cron job that looks for such corrupted proxies. It is not yet clear how much this affected ATLAS during the weekend. The lemon monitoring shows that ATLAS stopped, or reduced, the load generator about midday on Sunday. 41

Some (Informal) Observations (HRR) The CCRC'08 elog is for internal information and problem solving but does not replace, and is not part of, existing operational procedures. Outside of normal working hours ggus and CERN remedy tickets are not looked at. Currently the procedure for ATLAS to raise critical operations issues themselves is to send an to the list atlas-grid-alarm. This is seen by the 24 hour operator who may escalate to the sysadmin piquet who can in turn escalate to the FIO piquet. Users who can submit to this list are K.Bos, S.Campana, M.Branco and A.Nairz. It would be good for IT operations to know what to expect from ATLAS operations when something changes. This may be already in the dashboard pages. (Formal follow-up to come…) 42

Monitoring, Logging & Reporting Need to follow-up on: Accurate & meaningful presentation of status of experiments’ productions wrt stated goals “Critical Services” – need input from the experiments on “check-lists” for these services, as well as additional tests MoU targets – what can we realistically measure & achieve?  The various views that are required need to be taken into account e.g. sites, depending on VOs supported, overall service coordination, production managers, project management & oversite  March / April F2Fs plus collaboration workshop, review during June CCRC’08 “post-mortem” 43

Supporting the Experiments Need to focus our activities so that we support the experiments in as efficient & systematic manner as possible Where should we focus this effort to have maximum effect? What “best practices” and opportunities for “cross fertilization” can we find? The bottom line: it is in everybody’s interest that the services run as smoothly and reliably as possible and that the experiments maximize the scientific potential of the LHC and their detectors…  Steady, systematic improvements with clear monitoring, logging & reporting against “SMART” metrics seems to be the best approach to achieving these goals 44

Draft List of SRM v2.2 Issues Priorities to be discussed & agreed: Protecting spaces from (mis-)usage by generic users Concerns dCache, CASTOR Tokens for PrepareToGet/BringOnline/srmCopy (input) Concerns dCache, DPM, StoRM Implementations fully VOMS-aware Concerns dCache, CASTOR Correct implementation of GetSpaceMetaData Concerns dCache, CASTOR Correct size to be returned at least for T1D1 Selecting tape sets Concerns dCache, CASTOR, StoRM ¿by means of tokens, directory paths, ?? 45