HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP 32 25 March 2014, Pitlochry, Scotland.

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Advertisements

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
HTCondor at the RAL Tier-1
European HTCondor Workshop December 2014 summary Ian Collier (Brial Bockelman, Greg Thain, Todd Tannenbaum) GDB 10th December 2014.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
HTCondor within the European Grid & in the Cloud
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
DONVITO GIACINTO (INFN) ZANGRANDO, LUIGI (INFN) SGARAVATTO, MASSIMO (INFN) REBATTO, DAVID (INFN) MEZZADRI, MASSIMO (INFN) FRIZZIERO, ERIC (INFN) DORIGO,
Accounting Update Dave Kant Grid Deployment Board Nov 2007.
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
Two Years of HTCondor at the RAL Tier-1
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.
Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Cloud services at RAL, an Update 26 th March 2015 Spring HEPiX, Oxford George Ryall, Frazer Barnsley, Ian Collier, Alex Dibbo, Andrew Lahiff V2.1.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Grid job submission using HTCondor Andrew Lahiff.
Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.
Your university or experiment logo here Tier1 Deployment Steve Traylen.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.
A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
Monitoring with InfluxDB & Grafana
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
The GridPP DIRAC project DIRAC for non-LHC communities.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
Accounting Update John Gordon. Outline Multicore CPU Accounting Developments Cloud Accounting Storage Accounting Miscellaneous.
CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
HTCondor Accounting Update
HTCondor Accounting Update
WLCG IPv6 deployment strategy
The New APEL Client Will Rogers, STFC.
HTCondor at RAL: An Update
HTCondor at the RAL Tier-1
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Workload Management System
Moving from CREAM CE to ARC CE
The CREAM CE: When can the LCG-CE be replaced?
CREAM-CE/HTCondor site
PES Lessons learned from large scale LSF scalability tests
“Better to avoid problems than deal with them when they occur”
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Command Line Monitoring Tool
Basic Grid Projects – Condor (Part I)
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Presentation transcript:

HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland

2 Outline 1. RAL batch system –Background –Why we migrated to HTCondor –Experience so far –Features 2. ARC CEs

HTCondor 3

Batch system at the RAL Tier-1 –656 worker nodes, 9312 slots, HEPSPEC06 –Growing soon to beyond slots Torque + Maui had been used for many years –Had many issues –Severity & number of problems increased as size of farm –Increasing effort just to keep it working In August 2012 started looking for an alternative 4 RAL batch system Jobs running for the past 4 years (as of Aug 2013)

Considered, tested & eventually rejected the following technologies: –LSF, Univa Grid Engine Avoid commercial products unless absolutely necessary –Open source Grid Engines Competing products, not sure which has best long-term future Communities appear less active than SLURM & HTCondor Existing Tier-1s using Univa Grid Engine rather than open source –Torque 4 + Maui Maui problematic Torque 4 seems less scalable than alternatives –Easily able to cause problems with high job submission / query rates –SLURM Carried out extensive testing and comparison with HTCondor Found that for our use case: –Very fragile, easy to break –Unable to get to work reliably above 6000 jobs slots For more information, see 5 Choosing a new batch system

HTCondor chosen as replacement for Torque + Maui –Has the features we require –Seems very stable –Easily able to run 16,000 simultaneous jobs Prior to deployment into production we didn’t try larger numbers of jobs –Didn’t expect to exceed this number of slots within the next few years –See later slide Didn’t do any tuning – it “just worked” –Is more customizable than all other batch systems 6 Choosing a new batch system

7 Migration to HTCondor Timeline 2012 Aug- Started evaluating alternatives to Torque/Maui 2013 June- Began testing HTCondor with ATLAS & CMS 2013 Aug- Choice of HTCondor approved by RAL Tier-1 management 2013 Sept- Declared HTCondor as a production service - Moved 50% of pledged CPU resources to HTCondor (upgraded WNs to SL6 as well as migrating to HTCondor) 2013 Nov- Migrated remaining resources to HTCondor

Experience –Very stable operation, no crashes or memory leaks –Job start rate much higher than Torque + Maui, even when throttled –Staff able to spend time investigating improvements/new features, not just fire-fighting Problems –During early testing found job submission hung when 1 of a HA pair of central managers was switched off Quickly fixed & released in –Experienced jobs dying 2 hours after network break between CEs and WNs Quickly fixed & released in –Found problem affecting HTCondor-G job submission to ARC CE with HTCondor batch system Quickly fixed & released in –i.e. support by HTCondor developers has been very good –Self-inflicted problems Accidently had Quattor configured to restart HTCondor on worker nodes when configuration changed, rather than running “condor_reconfig” – jobs were killed 8 Experience so far

HTCondor Somewhat different to Torque, SLURM, Grid Engine, LSF –Jobs state their needs (Requirements) & preferences (Rank) –Machines state their needs (Requirements) & preferences (Rank) –Represented as ClassAds AttributeName = Value AttributeName = Expression –The negotiator (match-maker) matches job ClassAds with machine ClassAds, taking into account Requirements of the machine & the job Rank of the machine & the job Priorities Daemons –Startd: represents resource providers (machines) –Schedd: represents resource consumers (jobs) –Collector: stores job & machine ClassAds –Negotiator: matches jobs to machines –Master: watches over other daemons & restarts them if necessary 9

HTCondor Architecture 10 worker nodes condor_startd condor_collectorcondor_negotiator CEs condor_schedd central manager

HTCondor is not difficult, but it’s very configurable –“HTCondor... there's a knob for that” –Gives people the impression that it’s complicated? Most basic install & configuration is trivial –Installs collector, negotiator, schedd and startd on the same machine # yum install condor # service condor start Immediately can run jobs -bash-4.1$ condor_submit condor.sub Submitting job(s). 1 job(s) submitted to cluster 1. -bash-4.1$ condor_q -- Submitter: lcg0732.gridpp.rl.ac.uk : : lcg0732.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 alahiff 3/4 10: :00:02 R sleep 60 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended 11 Initial install

Pre-requisites A shared filesystem is not required –HTCondor can transfer files between WNs and CEs Doesn’t use scp, therefore no ssh configuration is required Can use encryption Can configure limits on concurrent uploads, downloads, file sizes, etc Resources –For “small” sites (including RAL Tier-1) central manager uses very little CPU & memory –CEs need lots of memory condor_shadow process for each running job, requiring ~ 1 MB per slot Use of 32-bit HTCondor rpms on CEs (only) reduces memory requirements –At RAL: 5 CEs, with a schedd on each Each CE has 20 GB RAM 12

Queues –HTCondor has no concept of queues in the Torque sense We see no reason to have such queues –Jobs just request their requirements Cpus, memory, disk, OS, … Fairshares –Hierarchical accounting groups Highly available central manager –Shared filesystem not required, HTCondor replicates the required data Python API –Useful for monitoring condor_gangliad –Generates lots of ganglia metrics for all HTCondor daemons 13 Features

Features we’re using –Per-job PID namespaces Jobs cannot interfere with or even see other processes running on the WN, even those by the same user Features we’re testing –CPU affinity (currently being tested on production WNs) Ensure jobs can only use the number of cores they requested –Cgroups Makes it impossible for quickly-forking processes to escape control, more accurate memory accounting CPU: ensure jobs use their share of cores, or more if others are idle Memory: protect both the WN & jobs from jobs using too much memory –“Mount under scratch” (bind mounts) Each job has it’s own /tmp, /var/tmp Haven’t got working yet with glexec jobs (but should be possible) Features we haven’t tried (yet) –Named chroots 14 Job isolation

Implemented using a startd cron –Can put arbitrary information into ClassAds, including output from scripts –We put information about problems into the WN ClassAds Used to prevent new jobs from starting in the event of problems Checks CVMFS, disk, swap, … If problem with ATLAS CVMFS, only stops new ATLAS jobs from starting –Example query: # condor_status -constraint 'NODE_STATUS =!= "All_OK" && PartitionableSlot == True' -autoformat Machine NODE_STATUS lcg0975.gridpp.rl.ac.uk Problem: CVMFS for atlas.cern.ch lcg1120.gridpp.rl.ac.uk Problem: CVMFS for lhcb.cern.ch lcg1125.gridpp.rl.ac.uk Problem: CVMFS for atlas.cern.ch lcg1127.gridpp.rl.ac.uk Problem: CVMFS for atlas.cern.ch lcg1406.gridpp.rl.ac.uk Problem: CVMFS for alice.cern.ch lcg1423.gridpp.rl.ac.uk Problem: CVMFS for atlas.cern.ch lcg1426.gridpp.rl.ac.uk Problem: CVMFS for atlas.cern.ch lcg1520.gridpp.rl.ac.uk Problem: CVMFS for atlas.cern.ch 15 WN health-check

Modular architecture helps with scalability –Can add additional schedds –Can setup multi-tier collectors –Can even use multiple negotiators Simultaneous numbers of running jobs –Prior to deployment into production core worker nodes, high number of slots each Reached 16,000 running jobs without tuning –Didn’t attempt any more than this at the time –More recent testing core worker nodes, high number of slots each Exceeded 30,000 running jobs successfully –Didn’t attempt any more than this 16 Scalability

WN configuration –Partitionable slots: WN resources (CPU, memory, disk, …) divided up as necessary for jobs Partitioning of resources –We’re using dynamic allocation of multi-core jobs –Could partition resources if we wanted condor_defrag daemon –Finds WNs to drain, drains them, then cancels draining when necessary –Works immediately out-of-the-box, but we’re tuning it to: Minimize wasted CPUs Ensure start-up rate of multi-core jobs is adequate Maintain required number of running multi-core jobs 17 Multi-core jobs

HTCondor was designed to make use of opportunistic resources –No hard-wired list of WNs –No restarting of any services –No pre-configuration of potential WNs WNs advertise themselves to the collector With appropriate security permissions, can join the pool and run jobs Dynamic provisioning of virtual WNs –Can use existing power management functionality in HTCondor –condor_rooster Designed to wake-up powered down physical WNs as needed Can configure to run command to instantiate a VM –Easy to configure HTCondor on virtual WNs to drain then shutdown the VM after a certain time –Tested successfully at RAL (not yet in production) CHEP HEPiX Fall Dynamic WNs

Some interesting features –Job hooks Example: WNs pulling work, rather than work being pushed to WNs –Flocking Redirect idle jobs to a different HTCondor pool –Job router Transform jobs from one type to another Redirect idle jobs to –Other batch systems (HTCondor, PBS, LSF, Grid Engine) –Other grid sites (ARC, CREAM, …) –Clouds (EC2) –Jobs can be virtual machines HTCondor can manage virtual machines (VMware, Xen, KVM) –HTCondor-CE CE which is a standard HTCondor installation with different config OSG starting to phase this in Upcoming –Live job i/o statistics 19 Other features

ARC CEs 20

EMI-3 CREAM CE –HTCondor not officially supported BLAH supports HTCondor –Job submission works! HTCondor support in YAIM doesn’t exist in EMI-3 –We modified the appropriate YAIM function so that the blah configuration file is generated correctly Script for publishing dynamic information doesn’t exist in EMI-3 –Wrote our own based on the scripts in old CREAM CEs –Updated to support partitionable slots APEL parser for HTCondor doesn’t exist in EMI-3 –Wrote a script which writes PBS style accounting records from condor history files, which are then read by PBS APEL parser –Relatively straightforward to get an EMI-3 CREAM CE working We will make our scripts available to the community –Should eventually be in EMI Milan Tier-2 also helpful 21 HTCondor & grid middleware

ARC CEs –Successfully being used by some ATLAS & CMS Tier-2s outside of Nordugrid (with SLURM, Grid Engine, …) Including LRZ-LMU (ATLAS), Estonia Tier 2 (CMS) Glasgow and Imperial first UK sites to try ARC CEs –The LHC VOs and ARC CEs ATLAS and CMS fine –At RAL ATLAS & CMS only have access to ARC CEs LHCb added ability to DIRAC to submit to ARC CEs –Not yet at the point of LHCb only using ARC CEs ALICE can’t use them yet, but will work on this 22 ARC CEs

Benefits of ARC CEs –Support HTCondor better than CREAM CEs do Especially in the upcoming new release –Simpler than CREAM CEs Just a single config file: /etc/arc.conf No YAIM No Tomcat No MySQL –Information about jobs currently stored in files –ARC CE accounting publisher (JURA) can send accounting records directly to APEL using SSM APEL publisher node not required 23 Benefits

ARC CE architecture GridFTP + LDAP (pre-Web service) –Used by HTCondor-G (e.g. ATLAS & CMS pilot factories, glite-WMS) 24

ARC CE architecture Web service –Works with RFC proxies only; not used by HTCondor-G 25

HTCondor integration EMI-3 –3.0.3 ARC rpms –A number of modifications required to batch system backend scripts Main issue: scripts were written before HTCondor had partitionable slots UMD-3 –4.0.0 ARC rpms –Better Nordugrid svn trunk – –Contain 10 patches to HTCondor backend scripts submitted by RAL We have write access to the Nordugrid subversion repository –There will be a new official release soon As of 19 th March, release candidate almost ready Future –New HTCondor job submission script written from scratch (RAL) –… 26

Experience so far No major problems –Have generally just ignored them Issues –October: script checking status of jobs started taking so long it could no longer keep up Optimizations made (now in the nightly builds) Problem hasn’t happened again since –GGUS ticket open about DefaultSE in VO views Should be easy to fix 27

Due to scalability problems with Torque + Maui, migrated to HTCondor We are happy with the choice we made based on our requirements –Confident that the functionality & scalability of HTCondor will meet our needs for the foreseeable future We have both ARC & CREAM CEs working with HTCondor –Relatively easy to get working –Aim to phase-out CREAM CEs 28 Summary

Questions? Comments? 29