CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.

Slides:



Advertisements
Similar presentations
HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
HTCondor at the RAL Tier-1
HTCondor within the European Grid & in the Cloud
CMS Diverse use of clouds David Colling
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
Two Years of HTCondor at the RAL Tier-1
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.
Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Overview of day-to-day operations Suzanne Poulat.
Concept: Well-managed provisioning of storage space on OSG sites owned by large communities, for usage by other science communities in OSG. Examples –Providers:
Grid job submission using HTCondor Andrew Lahiff.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.
Fast Benchmark Michele Michelotto – INFN Padova Manfred Alef – GridKa Karlsruhe 1.
Multicore Accounting John Gordon, STFC-RAL WLCG MB, July 2015.
Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CE extensions requirements for the Information System Pre-GDB 11 th July 2012 Maria Alandes Pradillo CERN IT Department, Grid Technology Group With contributions.
A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
D. Giordano Second Price Enquiry Update L. Field On behalf of the IT-SDC WLC Resource Integration Team.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
2011/11/03 Partial downtimes management Pierre Girard WLCG T1 Service Coordination Meeting.
Monitoring with InfluxDB & Grafana
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
VO VOCE - Availability and Stability of Resources Enabling Grids for E-sciencE VO VOCE - Availability and Stability of Resources.
Next Steps after WLCG workshop Information System Task Force 11 th February
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
J. Templon Nikhef Amsterdam Physics Data Processing Group Nikhef Multicore Experience Jeff Templon Multicore TF
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Accounting Review Summary from the pre-GDB related to CPU (wallclock) accounting Julia Andreeva CERN-IT GDB 13th April
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.
Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Using HLRmon for advanced visualization of resource usage Enrico Fattibene INFN - CNAF ISCG 2010 – Taipei March 11 th, 2010.
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
Accounting Update John Gordon. Outline Multicore CPU Accounting Developments Cloud Accounting Storage Accounting Miscellaneous.
Best 20 jobs jobs sites.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
HTCondor Accounting Update
HTCondor Accounting Update
Running LHC jobs using Kubernetes
Title of the Poster Supervised By: Prof.*********
Connecting LRMS to GRMS
Cluster Optimisation using Cgroups
Benchmarking Changes and Accounting
Summary on PPS-pilot activity on CREAM CE
Moving from CREAM CE to ARC CE
Raw Wallclock in APEL John Gordon, STFC-RAL
LHCb Software & Computing Status
RDIG for ALICE today and in future
CREAM-CE/HTCondor site
Status of MC production on the grid
CPU efficiency Since May CMS O+C has launched a dedicated task force to investigate CMS CPU efficiency We feel the focus is on CPU efficiency is because.
WLCG Collaboration Workshop;
What’s Different About Overlay Systems?
Presentation transcript:

CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014

Introduction Reminder –RAL supports all LHC VOs –Batch system is HTCondor Multi-core configuration –Changes made since last RAL report at a multi-core task force meeting Changed the order in which the negotiator considers groups –Consider multi-core groups before single-core groups –Multi-core slots are ‘expensive’ to obtain, so don’t want to lose them too quickly –Changes made since CMS started submitting multi-core jobs None 2

CMS usage CMS multi-core jobs over the past month –in addition to 1-3K running CMS single-core jobs 3

Draining Wasted CPUs due to draining –CPUs are wasted due to draining in order to provide multi-core slots –At most around < 3% of total number of CPUs 4

Wall times Multi-core job wall times –Since 1 st June 5

CPU efficiency CPU efficiency of single-core jobs –Since 1 st June, jobs with wall time > 5 hours 6 CPU efficiency

CPU efficiency of multi-core jobs –Since 1 st June, jobs with wall time > 5 hours 7 CPU efficiency

CPU efficiency vs wall time for multi-core jobs –The longest jobs generally have poor CPU efficiency 8

CPU efficiency Overall CPU efficiency for CMS – June 2014 –Single-core jobs: 80.1% –Multi-core jobs: 58.2% 9

Why is the CPU efficiency low? CMS multi-core pilots are internally running single-core jobs –How many single-core jobs are running as a function of time? Simple analysis –Script running as cron on the CE every 20 mins which does: condor_ssh_to_job “ls glide_*/execute | wc –l” for all running CMS multi-core jobs –Gives number of single-core jobs running within the multi-core pilot –Use this data to reconstruct number of single-core jobs running as a function of time within each multi-core pilot 10

Why is the CPU efficiency low? Example job –Pilot ran for ~9.5 hours but only used all 8 CPUs for < 3 hours 11

Why is the CPU efficiency low? Another example job –Ran for over 4 hours using only 1 of the 8 requested cores 12

Summary CMS multi-core pilots running single-core jobs –Resources wasted in providing the multi-core slots –Resources wasted due to inefficient use of CPUs within the multi-core pilots Until multi-core jobs at sites are more common than single- core jobs, perhaps it’s best for multi-core pilots to only run multi-core jobs? –Longer-lived pilots, however, would help with the above issues 13