WLCG Tier1 [ Performance ] Metrics ~ Points for Discussion ~ WLCG GDB, 8 th July 2009.

Slides:

Advertisements

Similar presentations

WLCG Operations and Tools TEG Monitoring – Experiment Perspective Simone Campana and Pepe Flix Operations TEG Workshop, 23 January 2012.

Advertisements

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Ian Bird LCG Project Leader Jamie Shiers Grid Support Group, CERN WLCG Project Status Report NEC 2009 September 2009.

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.

Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

Overview of day-to-day operations Suzanne Poulat.

WLCG Service Report ~~~ WLCG Management Board, 24 th November

John Gordon STFC-RAL Tier1 Status 9 th July, 2008 Grid Deployment Board.

LCG Introduction John Gordon, STFC-RAL GDB September 9 th, 2008.

Summary of RSG/RRB Ian Bird GDB 9 th May 2012

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

LCG CCRC’08 Status WLCG Management Board November 27 th 2007

Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.

Automatic Resource & Usage Monitoring Steve Traylen/Flavia Donno CERN/IT.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

US-CMS T2 Centers US-CMS Tier 2 Report Patricia McBride Fermilab GDB Meeting August 31, 2007 Triumf - Vancouver.

WLCG Planning Issues GDB June Harry Renshall, Jamie Shiers.

LCG Support for Pilot Jobs John Gordon, STFC GDB December 2 nd 2009.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN

DJ: WLCG CB – 25 January WLCG Overview Board Activities in the first year Full details (reports/overheads/minutes) are at:

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.

Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.

Stephen Gowdy FNAL 9th Feb 2015CMS Computing Model Simulation 1.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.

Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

Using HLRmon for advanced visualization of resource usage Enrico Fattibene INFN - CNAF ISCG 2010 – Taipei March 11 th, 2010.

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.

Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

CERN IT Department CH-1211 Geneva 23 Switzerland t Service Reliability & Critical Services January 15 th 2008.

Daniele Bonacorsi Andrea Sciabà

Cross-site problem resolution Focus on reliable file transfer service

Update on Plan for KISTI-GSDC

~~~ LCG-LHCC Referees Meeting, 16th February 2010

Analysis Operations Monitoring Requirements Stefano Belforte

WLCG Service Report 5th – 18th July

WLCG Collaboration Workshop;

EGEE Operation Tools and Procedures

WLCG Workshop Introduction

Presentation transcript:

WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009

The Perennial Question During this presentation and discussion we will attempt to sharpen and answer the question: How can a Tier1 know that it is doing OK? We will look at: What we can (or do) measure (automatically); What else is important – but harder to measure (at least today); How to understand what “OK” really means… 2

Resources In principle, we know what resources are pledged, can determine what are actually installed(?) and can measure what is currently being used; If installed capacity is significantly(?) lower than pledged, this is an anomaly and site in question “is not doing ok” But actual utilization may vary – and can even exceed – “available” capacity for a given VO (particularly CPU – less or unlikely for storage(?))  This should also be signaled as an anomaly to be understood (it is: poor utilization over prolonged periods impacts future funding, even if there are good reasons for it…) 3

Services Here we have extensive tests (OPS, VO) coupled with production use A “test” can pass, which does not mean that experiment production is not (severely) impacted…) Some things are simply not realistic or too expensive to test… But again, significant anomalies should be identified and understood Automatic testing is one measure: GGUS tickets another (# tickets, including alarm, time taken for their resolution) This can no doubt be improved iteratively; additional tests / monitoring added (e.g. tape metrics) A site which is “green”, has few or no tickets open for > days | weeks, and no “complaints” at operations meeting is doing ok, surely? Can things be improved for reporting and long-term traceability? (expecting the answer YES) 4

The Metrics… For STEP’09 – as well as at other times – explicit metrics have been set against sites and for well defined activities Can such metrics allow us to “roll-up” the previous issues into a single view? If not, what is missing from what we currently do? Is it realistic to expect experiments to set such targets: During the initial period of data taking? (Will it be known at all what the “targets” actually are?) In the longer “steady state” situation? Processing & reprocessing? MC production? Analysis??? (largely not T1s…) Probable answer: only if it is useful for them to monitor their own production (which it should be..) 5

So… Definition: a site is “doing ok” if (and only if?): 1.It has provided (usable) resources that match those pledged & requested; 2.The services are running smoothly, pass the tests and meet reliability and availability targets; 3.“WLCG operations” metrics on handling scheduled and unscheduled service interruptions and degradations are met (more automation / reporting required here); 4.It is meeting or exceeding metrics for “functional blocks” (this latter will require some work: is it “fair” to expect experiments to define these? Must be as simple as “autopilot settings”)  And in practice? Some examples from CMS… 6

Example from CMS: Site Readiness Regularly used to evaluate CMS sites, results are shown at the weekly Facilities meeting Several metrics taken into account SAM availability must be > 90% for Tier-1 Job Robot success rate must be > 90% for Tier-1 Submitting about 600 jobs / (day*site) Number of commissioned links must be high enough Other metrics, not included but measured “backfill” jobs, prod/analysis success rates from Dashboard, link quality, etc. 7

So… Definition: a site is “doing ok” if (and only if?): 1.It has provided (usable) resources that match those pledged & requested; 2.The services are running smoothly, pass the tests and meet reliability and availability targets; 3.“WLCG operations” metrics on handling scheduled and unscheduled service interruptions and degradations are met (more automation / reporting required here); 4.It is meeting or exceeding metrics for “functional blocks” (this latter will require some work: is it “fair” to expect experiments to define these? Must be as simple as “autopilot settings”)  If it is as simple as this, all we have to do is: 8

Make it so… 9

The Dashboard Again… CHEP 2006

#Metric 1Site is providing (usable) resources that match those pledged & requested; 2The services are running smoothly, pass the tests and meet reliability and availability targets; 3“WLCG operations” metrics on handling scheduled and unscheduled service interruptions and degradations are met; 4Site is meeting or exceeding metrics for “functional blocks”. WLCG Tier1 (Performance) Metrics