Benchmarking Benchmarking in WLCG Helge Meinhard, CERN-IT HEPiX Fall 2015 at BNL 16-Oct-2015 2 Helge Meinhard (at) CERN.ch.

Slides:

Advertisements

Similar presentations

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

Advertisements

S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.

Communicating Machine Features to Batch Jobs GDB, April 6 th 2011 Originally to WLCG MB, March 8 th 2011 June 13 th 2012.

Moving out of SI2K How INFN is moving out of SI2K as a benchmark for Worker Nodes performance evaluation Michele Michelotto at pd.infn.it.

Test results Test definition (1) Istituto Nazionale di Fisica Nucleare, Sezione di Roma; (2) Istituto Nazionale di Fisica Nucleare, Sezione di Bologna.

Transition to a new CPU benchmark on behalf of the “GDB benchmarking WG”: HEPIX: Manfred Alef, Helge Meinhard, Michelle Michelotto Experiments: Peter Hristov,

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team 12 th CERN-Korea.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

Fast Benchmark Michele Michelotto – INFN Padova Manfred Alef – GridKa Karlsruhe 1.

Machine/Job Features Update Stefan Roiser. Machine/Job Features Recap Resource User Resource Provider Batch Deploy pilot Cloud Node Deploy VM Virtual.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

Benchmarking status Status of Benchmarking Helge Meinhard, CERN-IT WLCG Management Board 14-Jul Helge Meinhard (at) CERN.ch.

CERN IT Department CH-1211 Genève 23 Switzerland t IHEPCCC/HEPiX benchmarking WG Helge Meinhard / CERN-IT LCG Management Board 11 December.

Storage Accounting John Gordon, STFC GDB March 2013.

Automatic Resource & Usage Monitoring Steve Traylen/Flavia Donno CERN/IT.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

LHC Computing, CERN, & Federated Identities

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

NIKHEF11 th March WLCG Operational Costs M. Dimou, J. Flix, A. Forti, A. Sciabà WLCG Operations Coordination Team GDB – NIKHEF [11 th March.

CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.

HEPMARK2 Consiglio di Sezione 9 Luglio 2012 Michele Michelotto - Padova.

LHCbDirac and Core Software. LHCbDirac and Core SW Core Software workshop, PhC2 Running Gaudi Applications on the Grid m Application deployment o CVMFS.

Ian Bird WLCG Networking workshop CERN, 10 th February February 2014

Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013

SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.

Moore vs. Moore Rainer Schwemmer, LHCb Computing Workshop 2015.

CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.

Next Steps after WLCG workshop Information System Task Force 11 th February

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

Definitions Information System Task Force 8 th January

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Michele Michelotto – CCR Marzo 2015 Past present future Fall 2012 Beijing IHEP Spring 2013 – INFN Cnaf Fall 2013 – Ann Arbour, Michigan Univ. Spring.

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Farming Andrea Chierici CNAF Review Current situation.

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Storage Accounting John Gordon, STFC OMB August 2013.

Moving out of SI2K How INFN is moving out of SI2K as a benchmark for Worker Nodes performance evaluation Michele Michelotto at pd.infn.it.

CERN IT Department CH-1211 Genève 23 Switzerland t IHEPCCC/HEPiX benchmarking WG Helge Meinhard / CERN-IT Grid Deployment Board 09 January.

HEPiX spring 2013 report HEPiX Spring 2013 CNAF Bologna / Italy Helge Meinhard, CERN-IT Contributions by Arne Wiebalck / CERN-IT Grid Deployment Board.

The Grid Information System Maria Alandes Pradillo IT-SDC White Area Lecture, 4th June 2014.

Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.

Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April

DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.

Getting the Most out of Scientific Computing Resources

Evaluation of HEP worker nodes Michele Michelotto at pd.infn.it

WLCG Workshop 2017 [Manchester] Operations Session Summary

CCR Autunno 2008 Gruppo Server

Getting the Most out of Scientific Computing Resources

Ian Bird WLCG Workshop San Francisco, 8th October 2016

Connecting LRMS to GRMS

Dag Toppe Larsen UiB/CERN CERN,

Dag Toppe Larsen UiB/CERN CERN,

Outline Benchmarking in ATLAS Performance scaling

Update on Plan for KISTI-GSDC

Passive benchmarking of ATLAS Tier-0 CPUs

PES Lessons learned from large scale LSF scalability tests

CPU accounting of public cloud resources

Transition to a new CPU benchmark

WLCG Collaboration Workshop;

Presentation transcript:

Benchmarking Benchmarking in WLCG Helge Meinhard, CERN-IT HEPiX Fall 2015 at BNL 16-Oct Helge Meinhard (at) CERN.ch

Benchmarking Background Need for CPU benchmarking well established - Resource requests, pledges, installed capacity, accounting - Reference for procurement HS06 pretty well established and recognised Some latent feeling of uneasiness - Mostly by experiments, less by sites 16-Oct-2015 Helge Meinhard (at) CERN.ch3

Benchmarking GDB Discussion 09-Sep-2015 Series of discussions and reports in WLCG bodies - Grid Deployment Board (GDB) and Management Board (MB) - Mostly status updates by providers (Manfred, Michele, HM) 09-Sep-2015: GDB discussion taking it the other way Machine-job features - Benchmarking: contributions by the LHC experiments Long discussions highlighting uncomfortable feelings A number of areas to be improved … even though for some issues it may be the communication! IMO a little confused - Attempt to structure… (Manfred Alef, Ian Bird, Michel Jouvin, HM, …) 15-Sep-2015: MB discussion on the follow-up - Four areas for follow-up identified coordinated by a small group 16-Oct-2015 Helge Meinhard (at) CERN.ch4

Benchmarking Area 1: CPU Power Seen by Job Predict power of compute slot (batch, cloud) for the running job - Often needed for job matching and masonry - Two approaches: Use HS06 (via MJF) – possibly underestimate because of advanced CPU features Determine on the running job – possibly unprecise if workload changes Needs to be fast and to require access to job slot only HS06 clearly inappropriate – takes hours, requires licence, expects access to full machine Known candidates: LHCb Python script, ROOTmark, Drystone/Whetstone, KitValidation, HTCondor benchmarks, … Some work done, but not conclusive yet Ideally one single choice for all experiments and application types 16-Oct-2015 Helge Meinhard (at) CERN.ch5

Benchmarking Area 2: Whole-Server Benchmark Benchmark precisely a whole farm - Needed for procurements, pledges, installed capacity, CPU accounting, … - Requires (possibly long-running) benchmark programs controlling the full machine - HEP-SPEC 06 emerged from common WLCG/HEPiX activity back in 2007/ No known issue with HS06 per se No evidence of scaling issues beyond 10% - the initial objective Choice of boundary conditions for running HS06 has served community well - Applications, machines and industry-standard benchmarks have evolved since Should move forward to a new benchmark soon, following proper verification against typical experiment applications Candidates: (Subset of?) new SPEC CPU benchmark suite (expected to be released soon), Geant4 benchmark, … 16-Oct-2015 Helge Meinhard (at) CERN.ch6

Benchmarking Area 3: Accounting Acute or latent suspicion about accounting numbers being inaccurate - CPU time used times slot power in HS06 HS06 of machine divided by number of slots, or Average HS06 per slot of a whole compute farm or CE - Increasingly inaccurate due to increased machine sophistication – factors including Symmetric multi-threading / hyper-threading Turbo-boost Virtualisation … - Exactly the same reasons why (1) and (2) are potentially very different! Check whether this is the only source of discrepancy (and unhappiness) 16-Oct-2015 Helge Meinhard (at) CERN.ch7

Benchmarking Area 4: Storage and Transport … of benchmarking information Current attempt: Machine-job features - Definitely the right direction - Deployment is easy (still risks to take long due to chicken-egg situation) - Needed at least for precise estimate of lower limit of job slot performance, and for proper per- job accounting 16-Oct-2015 Helge Meinhard (at) CERN.ch8

Benchmarking MB Decision Quite some expertise around - … and even way more diverging views ;-) - Co-ordination and planning needed Establish a small group mandated to plan concrete steps to tackle issues (1) to (4) - Include all LHC experiments, selected site reps and benchmarking and accounting experts - Report back to MB (and GDB) - Subject to MB approval, kick off activities around issues (1) to (4) with clearly defined objectives and target dates In particular for (1) and (2), collaborate with HEPiX and their benchmarking experts Group now in the process of being formed 16-Oct-2015 Helge Meinhard (at) CERN.ch9

Benchmarking WG Michele Michelotto – INFN Padova 11

Two separate problems 12  A benchmark to understand the potential throughput in terms of events/sec that a Computing Worker Node can produce  Fast Benchmark: A program that try to “guesstimate” how many event can be produced in a time slot of batch node or on a Virtual Machine

The HS06 13  A benchmark to understand the potential throughput in terms of events/sec that a Computing Worker Node can produce  It is needed for procurements and tender  It is used also to describe the potential throughput of a cluster of WNs, of a Tier2, Tier1 etc…  So it is used for CPU power pledges

The HS06 14  This benchmark was SPEC CPU 2000 and since 2006 is Hep-Spec06 (aka HS06)  Long time to run, because it runs only once in the life of worker node  few percent precision, aiming to correlate with Full Simulation but HS06 came up in good agreement also with Reconstruction and other applications  Easy to run for computing vendor  Must be maintained  Must be stable  Must be free or at least cheap  Need a clear recipe to define it (compiler, optimization, 32/64, etc) it because is a unit of measurement like a pint

Fast Benchmark 15  Request mainly from WLCG community via GDB to have a fast benchmark  Users want to know about the performance of the provided job slot  They tried to use HS06 on that machine to estimate it but there are several factor…  Should they take HS06 for the WN and divides by logical core or by physical core? (20 job slot for a 20c/40t WN?)  Some sites do a little overbooking, e.g. for a worker node with 20 co/40t they define 30 job slots.  The HS06 doesn’t describe the actual load on the Worker Node, because of dynamic frequency scaling of CPU clock (clock throttling) but also because of competition for other hardware and software resources on the worker node

HS06/core and LHCB/core 16

Fast Benchmark 17  The fast benchmark runs for one to few minutes instead of hours  It measures the performances of the provided job slot (on batch farm as well as in cloud environment) and takes in account the actual load of the job slot (single threaded)  It takes in account the load in those few minutes while it is running but the load may be better (waste of cpu time slot) or worse (job is aborted) when the actual job will be running  Sometimes the WN can be in some bad condition so the fast benchmark will give a very poor result, uncorrelated with HS06/slot  Running on bare metal or virtualized could make a small difference (big differences if you use special instructions available only on real machines)

Sometimes LHCb gets slow 18  HS06/LHCb score is around 1.2 – 1.6  Occasionally is can go to more than 2.0

HS06 vs LHCB 19

Other fast Dhrystone 20

HS06 vs whetstone-double 21

Measuring on a production cluster 22  Manfred measured on a GridKa production cluster the LHCb.py score compared with the HS06 per slot  HS06/LHCB vs load

Future of HS06 23  In the past SPEC was forced to change the SPEC CPU benchmark very often 1989, 1992, 1995, 2000,  Now the core is rather stable. Increase in performances mainly from more core/processor. Attention to other issues like Power Consumption  SPEC is working since several years on the next CPU version tentatively call v6.  Rumors of a public release in 2014 were too optimistic  Trying to revive the HEPiX Working Group but it was too early. When the next version of SPEC will be released we will need to check again with the experiments and the community of experts.  Out of the records…