Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.

Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”

HTCondor at the RAL Tier-1

HTCondor within the European Grid & in the Cloud

Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter ： S.Y.Chen.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Two Years of HTCondor at the RAL Tier-1

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

1 BIG FARMS AND THE GRID Job Submission and Monitoring issues ATF Meeting, 20/06/03 Sergio Andreozzi.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

1 Overview of the Application Hosting Environment Stefan Zasada University College London.

G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.

DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.

The EDGeS project receives Community research funding 1 SG-DG Bridges Zoltán Farkas, MTA SZTAKI.

Grid job submission using HTCondor Andrew Lahiff.

1 Sergio Maffioletti Grid Computing Competence Center GC3 University of Zurich Swiss Grid School 2012 Develop High Throughput.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.

HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Installation and Configuration of A-REX Iván Márton (NIIFI) Zsombor Nagy (NIIFI)

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.

Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.

Monitoring with InfluxDB & Grafana

Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.

HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

EGEE 3 rd conference - Athens – 20/04/2005 CREAM JDL vs JSDL Massimo Sgaravatto INFN - Padova.

HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015.

TANYA LEVSHINA Monitoring, Diagnostics and Accounting.

EGEE is a project funded by the European Union under contract IST LCG open issues Massimo Sgaravatto INFN Padova JRA1 IT-CZ cluster meeting,

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.

Accounting Update John Gordon. Outline Multicore CPU Accounting Developments Cloud Accounting Storage Accounting Miscellaneous.

Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)

HTCondor Accounting Update

HTCondor at RAL: An Update

BDII Performance Tests

Moving from CREAM CE to ARC CE

CREAM-CE/HTCondor site

Accounting in HTCondor

Cristina del Cano Novales STFC - RAL

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Recent Developments in the ICAT Job Portal a generic job submission system built on a scientific data catalog NOBUGS October 2016 Brian Ritchie, Rebecca.

Troubleshooting Your Jobs

Presentation transcript:

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona

ARC CE architecture (Simplified) architecture LDAP + BDII A-REX job interface GridFTP LRMS job management scripts infoprovider scripts client tools info jobs files A-REX=ARC Resource-coupled EXecution service GridFTP server

LRMS scripts LRMS job management scripts –submit-LRMS-job does any required preparations (e.g. create job submission script) submits jobs to the batch system –scan-LRMS-job queries batch system about running, idle & completed jobs extracts information about completed jobs –cancel-LRMS-job kills jobs when requested by users Information system –LRMS.pm generates information about the cluster as a whole

LRMS scripts In /usr/share/arc submit-boinc-job submit-fork-job submit-lsf-job submit-sge-job submit- condor-job submit-DGBridge-job submit-ll-job submit-pbs-job submit- SLURM-job cancel-boinc-job cancel-condor-job cancel-fork-job cancel-lsf-job cancel-sge-job cancel-DGBridge-job cancel-ll-job cancel-pbs-job cancel- SLURM-job scan-boinc-job scan-DGBridge-job scan-ll-job scan-pbs-job scan-SLURM-job scan-condor-job scan-fork-job scan-lsf-job scan-sge-job Boinc.pm PBS.pm SLURM.pm SLURMmod.pm LL.pm DGBridge.pm SGEmod.pm SGE.pm Condor.pm LSF.pm

ARC-CE HTCondor history Some main points –< partitionable slots not supported –4.1.0 [2014] partitionable slots supported, lots of fixes –5.0.0 [2015] scan-condor-job can read directly from per-job history files job priority passed to HTCondor

Configuration The LRMS is specified in /etc/arc.conf [common]... lrms=“condor”... ARC will submit jobs to the local schedd Other HTCondor-related configuration –condor_requirements: sets the requirements expression for jobs submitted to each queue, we use: (Opsys == "linux") && (OpSysAndVer == "SL6") –condor_rank: sets the rank expression for jobs

Queues At RAL we’ve only ever had one queue per CE with HTCondor + ARC –Jobs need to request numbers of cores, memory, CPU time, wall time Can specify a HTCondor requirements expression for each queue –this gets added to jobs submitted to the queue Default requested memory can be taken from queue configuration

Job parameters passed to HTCondor Parameters –job description –number of cores –memory –job priority Limits (added to each job’s periodic remove expression) –memory limit (same as requested memory) –wall clock time limit –CPU time limit

Job parameters passed to HTCondor Example ATLAS job (fragment of job ClassAd)... JobDescription = "mc15_13TeV_3618" RequestCpus = 8 RequestMemory = JobPrio = -38 JobCpuLimit = JobTimeLimit = JobMemoryLimit = PeriodicRemove = false || RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > JobTimeLimit || ResidentSetSize > JobMemoryLimit...

Something you may want to hack /usr/share/arc/submit-condor-job –We’ve commented-out this line: REMOVE="${REMOVE} || ResidentSetSize > JobMemoryLimit” –because using cgroups with soft memory limit our SYSTEM_PERIODIC_REMOVE expression will cause jobs using > 3x requested memory to be removed

Accounting APEL publisher node not required –JURA component of ARC sends accounting data directly to APEL central broker Scaling factors –Unlike some other batch systems (e.g. Torque), HTCondor doesn’t scale wall & CPU time –This is good! limits on normalized time always confuse people

Accounting Worker node HTCondor configuration contains, e.g. ScalingFactor = 2.67 STARTD_ATTRS = $(STARTD_ATTRS) ScalingFactor On the CEs MachineScalingFactor = "$$([ScalingFactor])" SUBMIT_EXPRS = $(SUBMIT_EXPRS) MachineScalingFactor ClassAds for completed jobs then contain: MATCH_EXP_MachineScalingFactor = " E+00” An ARC auth plugin for jobs in the FINISHING state –scales CPU & wall time appropriately before JURA sends accounting data to APEL

Useful files /job..errors –dump of the executable submitted to HTCondor –stdout/err from condor_submit –completed jobs: full job ClassAd + extracted information / /log –HTCondor job user log /var/log/arc/gm-jobs.log –log of jobs starting & finishing –includes grid job ID, local HTCondor job id, uid, DN

Monitoring Nagios checks & metrics for the consistency in the numbers of running/idle jobs between ARC & HTCondor –a sign of problems: scan-condor-job not able to keep up

ARC CEs around the world From the information system –73 ARC CEs in total –26 HTCondor the rest are mostly SLURM, PBS/Torque