Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

Slides:

Advertisements

Similar presentations

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.

Advertisements

Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

HTCondor at the RAL Tier-1

Scheduling under LCG at RAL UK HEP Sysman, Manchester 11th November 2004 Steve Traylen

Pre-GDB on Batch Systems (Bologna)11 th March Torque/Maui PIC and NIKHEF experience C. Acosta-Silva, J. Flix, A. Pérez-Calero (PIC) J. Templon (NIKHEF)

HTCondor within the European Grid & in the Cloud

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

DONVITO GIACINTO (INFN) ZANGRANDO, LUIGI (INFN) SGARAVATTO, MASSIMO (INFN) REBATTO, DAVID (INFN) MEZZADRI, MASSIMO (INFN) FRIZZIERO, ERIC (INFN) DORIGO,

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Two Years of HTCondor at the RAL Tier-1

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.

HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

CY2003 Computer Systems Lecture 09 Memory Management.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

Lessons for the naïve Grid user Steve Lloyd, Tony Doyle [Origin: 1645–55; < F, fem. of naïf, OF naif natural, instinctive < L nātīvus native ]native.

Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.

Grid job submission using HTCondor Andrew Lahiff.

Ideas for a virtual analysis facility Stefano Bagnasco, INFN Torino CAF & PROOF Workshop CERN Nov 29-30, 2007.

1 Scheduling The part of the OS that makes the choice of which process to run next is called the scheduler and the algorithm it uses is called the scheduling.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

Time Management Personal and Project. Why is it important Time management is directly relevant to Project Management If we cannot manage our own time.

A User-Lever Concurrency Manager Hongsheng Lu & Kai Xiao.

Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

Automatic Resource & Usage Monitoring Steve Traylen/Flavia Donno CERN/IT.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Accounting Update John Gordon and Stuart Pullinger January 2014 GDB.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

CERN Running a LCG-2 Site – Oxford July - 1 LCG2 Administrator’s Course Oxford University, 19 th – 21 st July Developed.

Monitoring with InfluxDB & Grafana

HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.

Next Steps after WLCG workshop Information System Task Force 11 th February

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

Definitions Information System Task Force 8 th January

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

J. Templon Nikhef Amsterdam Physics Data Processing Group Nikhef Multicore Experience Jeff Templon Multicore TF

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.

Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.

Farming Andrea Chierici CNAF Review Current situation.

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

Accounting Update John Gordon. Outline Multicore CPU Accounting Developments Cloud Accounting Storage Accounting Miscellaneous.

CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.

HTCondor Accounting Update

Getting the Most out of Scientific Computing Resources

WLCG IPv6 deployment strategy

Getting the Most out of Scientific Computing Resources

The New APEL Client Will Rogers, STFC.

HTCondor at RAL: An Update

Summary on PPS-pilot activity on CREAM CE

Moving from CREAM CE to ARC CE

The CREAM CE: When can the LCG-CE be replaced?

CREAM-CE/HTCondor site

Operating Systems.

Chapter 4: Simulation Designs

Presentation transcript:

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014

RAL supports many VOs –ALICE, ATLAS, CMS, LHCb –~12 non-LHC VOs Size of batch system –Currently over 9300 job slots Used Torque/Maui for many years –Reliability/stability degraded as number of cores/WNs increased Migrated to HTCondor last year –Started running production ATLAS/CMS jobs in June, parallel to production Torque/Maui –In November fully moved to HTCondor Multi-core usage –ATLAS almost continuously since November –CMS have run a few test jobs –No interest so far from other VOs 2 Introduction

Submission of multi-core jobs to RAL ATLAS and CMS only using ARC CEs (no CREAM!) Configuration –Setup with only a single queue –Any VO can run multi-core jobs if they want to Just have to request > 1 CPU. E.g. add to XRSL/GlobusURL: (count=8) Any number of CPUs can be requested (2, 4, 8, 31, …) –But a job requesting 402 CPUs, for example, is unlikely to run –Memory requirements Jobs also request how much memory they need, e.g. (memory=2000) For multi-core jobs this is the memory required (in MB) per core –Total memory required passed from CE to HTCondor 3 Job submission

Multi-core worker nodes in HTCondor – 3 possibilities 1.Divide resources evenly (default) 1 core per slot Memory divided equally between slots Fixed number of slots 2.Define slot types Specify (static) resources available to each slot, don’t need to be divided equally Fixed number of slots 3.Partitionable slots (see next slide) Clearly –Choices 1 & 2 not suitable for environments where jobs have different memory & core requirements 4 Worker node configuration

We’re using partitionable slots –All WNs configured to have partitionable slots –Each WN has a single partitionable slot with a fixed set of resources (cores, memory, disk, swap, …) –These resources can then be divided up as necessary for jobs Dynamic slots created from the partitionable slot When dynamic slots exit, merge back into the partitionable slot When any single resource is used up (e.g. memory), no more dynamic slots can be created –Enables us to run jobs with different memory requirements, as well as jobs requiring different numbers of cores –Our configuration: NUM_SLOTS = 1 SLOT_TYPE_1 = cpus=100%,mem=100%,auto NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1_PARTITIONABLE = TRUE 5 Worker node configuration

–Initially just had a single accounting group for all ATLAS production jobs Single & multi-core jobs in the same accounting group Problem: fairshare could be maintained by just running single core jobs –Added accounting groups for ATLAS & CMS multi-core jobs –Force accounting groups to be specified for jobs using SUBMIT_EXPRS on CEs Easy to include groups for multi-core jobs, e.g.: AccountingGroup = … ifThenElse(regexp("patl",Owner) && RequestCpus > 1, “group_ATLAS.prodatls_multicore", \ ifThenElse(regexp("patl",Owner), “group_ATLAS.prodatls", \ … SUBMIT_EXPRS = $(SUBMIT_EXPRS) AccountingGroup 6 Fairshares

If lots of single core jobs are idle & running, how does a multi-core job start? –By default it probably won’t condor_defrag daemon –Finds worker nodes to drain, drains them & cancels draining when necessary drain = no new jobs can start but existing jobs continue to run –Many configurable parameters, including: How often to run Maximum number of machines running multi-core jobs Maximum concurrent draining machines Expression that specifies which machines to drain Expression that specifies when to cancel draining Expression that specifies which machines are already running multi-core jobs Expression that specifies which machines are more desirable to drain 7 Defrag daemon

Improvements to default configuration (1) –condor_defrag originally designed for enabling whole node jobs to run We currently want to run 8-core jobs, not whole node jobs –Definition of “whole” machines Only drain up to 8-cores, not entire machines Changed the default: DEFRAG_WHOLE_MACHINE_EXPR = Cpus == TotalCpus && Offline=!=True to DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) && Offline=!=True Here Cpus = free CPUs 8 Defrag daemon

Improvements to default configuration (2) –Which machines are more desirable to drain We assume that: –On average all jobs have the same wall time –Wall time used by jobs is not related to the wall time they request Weighting factor Number of cores available for draining / Number of cores that need draining Changed the default: DEFRAG_RANK = -ExpectedMachineGracefulDrainingBadput to: DEFRAG_RANK = ifThenElse(Cpus >= 8, -10, (TotalCpus - Cpus)/(8.0 - Cpus)) Why make this change? –Previously only older 8-core machines were selected for draining –Now machines with the most numbers of cores are selected for draining »8-cores become available more quickly 9 Defrag daemon The job runtime in cpu-seconds that would be lost if graceful draining were initiated at the time the machine’s ad was published. Assumes jobs will run to their walltime limit. Used CPUs / (8 – Free CPUs)

Effect of change to DEFRAG_RANK Effect of switching off condor_defrag –Will number of running multi-core jobs quickly reduce to zero? –Number of running multi-core jobs decreased slowly 10 Defrag daemon Cancelled all draining worker nodes Running multi-core jobs Conclusion: Need to drain slots continuously to maintain number of running multi-core jobs

Past month Jobs idle/running Multi-core jobs idle/running Limit of number of running ATLAS multi-core jobs has been their fairshare Originally saw a sawtooth pattern of running multi-core jobs ATLAS then made a change to keep a smoother supply of multi-core jobs 11 Recent jobs

Wall times since 1 st Feb 12 Recent jobs ATLAS production (single core) ATLAS analysis (single core) ATLAS production (multi core) ATLAS ‘timefloor’ parameter -Analysis pilots running < 1 hour will pick up another job -To help with draining, might change this or add a short analysis queue -Might add timefloor parameter to multi-core queue so that the slots are kept longer

Wait times since 1 st Feb 13 Recent jobs ATLAS production (single core) ATLAS production (multi core)

Time to drain 8 slots 14 Recent jobs

Defrag daemon is currently quite simple –Not aware of demand (e.g. idle multi-core jobs) –Drains a constant number of worker nodes all the time Added monitoring of wasted CPUs due to draining –Past month –We an clearly see the wastage – it’s not hidden within a multi-core pilot running a mixture of single & multi-core jobs Important question: can we reduce the number of wasted CPUs? 15 Wasted resources Attempting to reduce wasted resources (next slide)

Method to reduce wasted resources –Cron which sets parameters of defrag daemon based on running & idle multi-core jobs In particular number of concurrent draining WNs Runs every 30 mins –1 st attempt very simple, considers 3 situations Many idle multi-core jobs, few running multi-core jobs –Need aggressive draining Many idle multi-core jobs, many running multi-core jobs –Need less aggressive draining – just enough to maintain number of currently running multi-core jobs Otherwise –Very low amount of draining required 16 Wasted resources

Example from 24 th Feb 17 Wasted resources Aggressive draining Less aggressive draining Multi-core jobs only shown

BDII hasn’t been a priority –ATLAS and CMS don’t even use it! –Currently: $ ldapsearch -x -h arc-ce01.gridpp.rl.ac.uk -p LLL -b o=grid | grep "GlueHostArchitectureSMPSize:” GlueHostArchitectureSMPSize: 1 –May fix this if someone needs it Accounting –ARC CEs send accounting records directly to APEL brokers –With latest ARC rpms multi-core jobs are handled correctly 18 Information system / accounting

Reducing wasted resources –Improvements to script which adjusts defrag daemon parameters –Make multi-core slots “sticky” It takes time & resources to provide multi-core slots After a multi-core job runs, prefer to run another multi-core job rather than multiple single core jobs –Try to run “short” jobs on slots which are draining? Multiple multi-core jobs per WN –By default, condor_defrag assumes “whole node” rather than “multi- core” jobs –Currently will have at most 1 multi-core job per WN (unless there aren’t many single core jobs to run) –Modify condor_defrag configuration as necessary to allow this 19 Future plans