CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.

Slides:



Advertisements
Similar presentations
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
Advertisements

by Marc Comeau. About A Webmaster Developing a website goes far beyond understanding underlying technologies Determine your requirements.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Middleware Deployment and Support in EGEE.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
SBTeach Introduction School of Business Course Coordination System.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
LCG CCRC’08 Status WLCG Management Board November 27 th 2007
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
24 x 7 support in Amsterdam Jeff Templon NIKHEF GDB 05 september 2006.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
November 1, 2004 ElizabethGallas -- D0 Luminosity Db 1 D0 Luminosity Database: Checklist for Production Elizabeth Gallas Fermilab Computing Division /
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
User Support of WLCG Storage Issues Rob Quick OSG Operations Coordinator WLCG Collaboration Meeting Imperial College, London July 7,
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
LCG Pilot Jobs + glexec John Gordon, STFC-RAL GDB 7 December 2007.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE WLCG File Transfer Service Sophie Lemaitre – Gavin Mccance Joint EGEE and OSG Workshop.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI GLUE 2: Deployment and Validation Stephen Burke egi.eu EGI OMB March 26 th.
Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
The CREAM CE: When can the LCG-CE be replaced?
Jamie Shiers ~~~ WLCG MB, 19th February 2008
Presentation transcript:

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008

Jeff Templon – JRA1 NIKHEF, At least our users aren’t malicious

Jeff Templon – JRA1 NIKHEF, What happens when u Each experiment streams data into the T0 u The experiments’ data model is followed T0-T1-T2 u Necessary computing (reconstruction, calibration) is done u Sites try to reach the MoU targets for uptimes u Graduate students try to analyze the data as it comes in u All four experiments try this at the same time, at scale CCRC

Jeff Templon – JRA1 NIKHEF, CCRC Goals u For February, we have the following three sets of metrics: 1. The scaling factors published by the experiments for the various functional blocks that will be tested. These are monitored continuously by the experiments and reported on at least weekly; 2. The lists of Critical Services, also defined by the experiments. These are complementary to the above and provide additional detail as well as service targets. It is a goal that all such services are handled in a standard fashion – i.e. as for other IT-supported services – with appropriate monitoring, procedures, alarms and so forth. Whilst there is no commitment to the problem-resolution targets – as short as 30 minutes in some cases – the follow-up on these services will be through the daily and weekly operations meetings; 3. The services that a site must offer and the corresponding availability targets based on the WLCG MoU. These will also be tracked by the operations meetings.  Phase 2 of CCRC in May 4

Jeff Templon – JRA1 NIKHEF, Scaling Factors u We won’t make it in Feb... Functional problems n Castor can’t handle load n At least on exp’t framework can’t handle load n FTS corrupt proxy problems (race condition) u Note how “data driven” HEP is : without functional data flow, test at scale is not possible!

Jeff Templon – JRA1 NIKHEF, FTS “corrupted proxies” issue u The proxy is only delegated if required n The condition is lifetime < 4 hours. u The delegation is performed by the glite-transfer-submit CLI. The first submit client that sees that the proxy needs to be redelegated is the one that does it - the proxy then stays on the server for ~8 hours or so n Default lifetime is 12 hours.  We found a race condition in the delegation - if two clients (as is likely) detect at the same time that the proxy needs to be renewed, they both try to do it and this can result in the delegation requests being mixed up - so that that what finally ends up in the DB is the certificate from one request and the key from the other. u We don’t detect this and the proxy remains invalid for the next ~8 hours. u The real fix requires a server side update (ongoing). 6

Jeff Templon – JRA1 NIKHEF, BDII Scaling Problem u BDII/SRM NIKHEF / SARA u Discovery : only possible via monitoring of n Jobs success by exp’ts (not always optimum) n Site services by site n Coupled phenomenon u BDII developer hears via ‘vocal site person’ about situation u Active support n Checking deployment scenario n Asking for log files n Making recommendations

Jeff Templon – JRA1 NIKHEF, Post mortem by developer

Jeff Templon – JRA1 NIKHEF, CCRC Goals u For February, we have the following three sets of metrics: 1. The scaling factors published by the experiments for the various functional blocks that will be tested. These are monitored continuously by the experiments and reported on at least weekly; 2. The lists of Critical Services, also defined by the experiments. These are complementary to the above and provide additional detail as well as service targets. It is a goal that all such services are handled in a standard fashion – i.e. as for other IT-supported services – with appropriate monitoring, procedures, alarms and so forth. Whilst there is no commitment to the problem-resolution targets – as short as 30 minutes in some cases – the follow-up on these services will be through the daily and weekly operations meetings; 3. The services that a site must offer and the corresponding availability targets based on the WLCG MoU. These will also be tracked by the operations meetings.  Phase 2 of CCRC in May 9

Jeff Templon – JRA1 NIKHEF, Weekly Operations Review u Based on 3 agreed metrics: 1. Experiments' scaling factors for functional blocks exercised 2. Experiments' critical services lists 3. MoU targets 10 ExperimentScaling FactorsCritical ServicesMoU Targets ALICE ATLAS CMS LHCb

Jeff Templon – JRA1 NIKHEF, Critical Services

Jeff Templon – JRA1 NIKHEF, CCRC Goals u For February, we have the following three sets of metrics: 1. The scaling factors published by the experiments for the various functional blocks that will be tested. These are monitored continuously by the experiments and reported on at least weekly; 2. The lists of Critical Services, also defined by the experiments. These are complementary to the above and provide additional detail as well as service targets. It is a goal that all such services are handled in a standard fashion – i.e. as for other IT-supported services – with appropriate monitoring, procedures, alarms and so forth. Whilst there is no commitment to the problem-resolution targets – as short as 30 minutes in some cases – the follow-up on these services will be through the daily and weekly operations meetings; 3. The services that a site must offer and the corresponding availability targets based on the WLCG MoU. These will also be tracked by the operations meetings.  Phase 2 of CCRC in May 12

Jeff Templon – JRA1 NIKHEF, Site performance

Jeff Templon – JRA1 NIKHEF, % availability means < 100 minutes per week!

Jeff Templon – JRA1 NIKHEF,

Jeff Templon – JRA1 NIKHEF, GGUS is not fast enough...

Jeff Templon – JRA1 NIKHEF, Middleware Coverage u AAA : already reasonably well stressed u FTS : broader range of usage, target SRMs, new SRM interface, higher rate (race condition...) u SRMs : stressed to max u WMS : unknown to what extent. LHCb apparently using RB. u CREAM : big miss. Should push extremely hard to get this ready for phase 2. u Glexec / LCMAPS-server : another big miss u All products could use improvement in logging / diagnostics / monitoring!!!!!

Jeff Templon – JRA1 NIKHEF, Handling Problems… u Need to clarify current procedures for handling problems – some mismatch of expectations with reality n e.g. no GGUS TPMs on weekends / holidays / nights…  c.f. problem submitted with max. priority at 18:34 on Friday… n Use of on-call services & expert call out as appropriate s {alice-,atlas-}grid-alarm; {cms-,lhcb-}operator-alarm; n Contacts are needed on all sides – sites, services & experiments s e.g. who do we call in case of problems? u Complete & open reporting in case of problems is essential! n Only this way can we learn and improve!  It should not require Columbo to figure out what happened… u Trigger post-mortems when MoU targets not met n This should be a light-weight operation that clarifies what happened and identifies what needs to be improved for the future n Once again, the problem is at least partly about communication! 18

Jeff Templon – JRA1 NIKHEF, Don’t panic u Many EGEE / JRA1 services are in considerably better shape than exp’t middleware u BUT this is no license to slow down or slack off : n exp’t efforts are often much more focused, they can catch up quickly n EGEE services are more critical : problems here affect all VOs / entire site. You *must* do better! n If exp’ts catch up and pass us, they will be merciless