CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.

Slides:

Advertisements

Similar presentations

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

HTCondor at the RAL Tier-1

HTCondor within the European Grid & in the Cloud

CMS Diverse use of clouds David Colling

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Two Years of HTCondor at the RAL Tier-1

 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Overview of day-to-day operations Suzanne Poulat.

Concept: Well-managed provisioning of storage space on OSG sites owned by large communities, for usage by other science communities in OSG. Examples –Providers:

Grid job submission using HTCondor Andrew Lahiff.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

Fast Benchmark Michele Michelotto – INFN Padova Manfred Alef – GridKa Karlsruhe 1.

Multicore Accounting John Gordon, STFC-RAL WLCG MB, July 2015.

Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CE extensions requirements for the Information System Pre-GDB 11 th July 2012 Maria Alandes Pradillo CERN IT Department, Grid Technology Group With contributions.

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

D. Giordano Second Price Enquiry Update L. Field On behalf of the IT-SDC WLC Resource Integration Team.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

2011/11/03 Partial downtimes management Pierre Girard WLCG T1 Service Coordination Meeting.

Monitoring with InfluxDB & Grafana

Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.

VO VOCE - Availability and Stability of Resources Enabling Grids for E-sciencE VO VOCE - Availability and Stability of Resources.

Next Steps after WLCG workshop Information System Task Force 11 th February

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

J. Templon Nikhef Amsterdam Physics Data Processing Group Nikhef Multicore Experience Jeff Templon Multicore TF

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Accounting Review Summary from the pre-GDB related to CPU (wallclock) accounting Julia Andreeva CERN-IT GDB 13th April

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.

News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.

Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

Using HLRmon for advanced visualization of resource usage Enrico Fattibene INFN - CNAF ISCG 2010 – Taipei March 11 th, 2010.

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

Accounting Update John Gordon. Outline Multicore CPU Accounting Developments Cloud Accounting Storage Accounting Miscellaneous.

Best 20 jobs jobs sites.

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.

HTCondor Accounting Update

HTCondor Accounting Update

Running LHC jobs using Kubernetes

Title of the Poster Supervised By: Prof.*********

Connecting LRMS to GRMS

Cluster Optimisation using Cgroups

Benchmarking Changes and Accounting

Summary on PPS-pilot activity on CREAM CE

Moving from CREAM CE to ARC CE

Raw Wallclock in APEL John Gordon, STFC-RAL

LHCb Software & Computing Status

RDIG for ALICE today and in future

CREAM-CE/HTCondor site

Status of MC production on the grid

CPU efficiency Since May CMS O+C has launched a dedicated task force to investigate CMS CPU efficiency We feel the focus is on CPU efficiency is because.

WLCG Collaboration Workshop;

What’s Different About Overlay Systems?

Presentation transcript:

CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014

Introduction Reminder –RAL supports all LHC VOs –Batch system is HTCondor Multi-core configuration –Changes made since last RAL report at a multi-core task force meeting Changed the order in which the negotiator considers groups –Consider multi-core groups before single-core groups –Multi-core slots are ‘expensive’ to obtain, so don’t want to lose them too quickly –Changes made since CMS started submitting multi-core jobs None 2

CMS usage CMS multi-core jobs over the past month –in addition to 1-3K running CMS single-core jobs 3

Draining Wasted CPUs due to draining –CPUs are wasted due to draining in order to provide multi-core slots –At most around < 3% of total number of CPUs 4

Wall times Multi-core job wall times –Since 1 st June 5

CPU efficiency CPU efficiency of single-core jobs –Since 1 st June, jobs with wall time > 5 hours 6 CPU efficiency

CPU efficiency of multi-core jobs –Since 1 st June, jobs with wall time > 5 hours 7 CPU efficiency

CPU efficiency vs wall time for multi-core jobs –The longest jobs generally have poor CPU efficiency 8

CPU efficiency Overall CPU efficiency for CMS – June 2014 –Single-core jobs: 80.1% –Multi-core jobs: 58.2% 9

Why is the CPU efficiency low? CMS multi-core pilots are internally running single-core jobs –How many single-core jobs are running as a function of time? Simple analysis –Script running as cron on the CE every 20 mins which does: condor_ssh_to_job “ls glide_*/execute | wc –l” for all running CMS multi-core jobs –Gives number of single-core jobs running within the multi-core pilot –Use this data to reconstruct number of single-core jobs running as a function of time within each multi-core pilot 10

Why is the CPU efficiency low? Example job –Pilot ran for ~9.5 hours but only used all 8 CPUs for < 3 hours 11

Why is the CPU efficiency low? Another example job –Ran for over 4 hours using only 1 of the 8 requested cores 12

Summary CMS multi-core pilots running single-core jobs –Resources wasted in providing the multi-core slots –Resources wasted due to inefficient use of CPUs within the multi-core pilots Until multi-core jobs at sites are more common than single- core jobs, perhaps it’s best for multi-core pilots to only run multi-core jobs? –Longer-lived pilots, however, would help with the above issues 13