Grid job submission using HTCondor Andrew Lahiff.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

HTCondor within the European Grid & in the Cloud

Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Basic Grid Job Submission Alessandra Forti 28 March 2006.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

Grid Computing I CONDOR.

3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.

INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

The GridPP DIRAC project DIRAC for non-LHC communities.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Job Router.

EGEE 3 rd conference - Athens – 20/04/2005 CREAM JDL vs JSDL Massimo Sgaravatto INFN - Padova.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks gLite – UNICORE interoperability Daniel Mallmann.

The GridPP DIRAC project DIRAC for non-LHC communities.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.

Condor DAGMan: Managing Job Dependencies with Condor

Intermediate HTCondor: Workflows Monday pm

HTCondor at the RAL Tier-1

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

IW2D migration to HTCondor

Moving from CREAM CE to ARC CE

CREAM-CE/HTCondor site

Troubleshooting Your Jobs

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

The Condor JobRouter.

Troubleshooting Your Jobs

Presentation transcript:

Grid job submission using HTCondor Andrew Lahiff

Overview Introduction Examples –Direct submission –Job Router –Match making

Introduction You don’t need EMI WMS or DIRAC to submit jobs to the grid HTCondor can submit to: –CEs (ARC, CREAM, UNICORE, Globus, …) –Batch systems (HTCondor, PBS, LSF, SGE) –Clouds (EC2, Google Compute Engine) –BOINC Grid manager daemon manages grid jobs –Started automatically by the schedd, one instance per user –Grid ASCII Helper Protocol (GAHP) server Runs under the grid manager Communicates with the remote service

Introduction This is not new –People were doing this 10 years ago –Europe moved on to Torque / Maui, LCG CE, glite-WMS, … –ATLAS & CMS pilot submission done using HTCondor-G Features –Input & output files can be transferred as necessary –Can monitor the job using standard HTCondor commands, e.g. condor_q –Can use MyProxy for proxy renewal –Very configurable Polices for dealing with failures, retries, etc Resubmission of jobs to other sites under specified conditions, e.g. idle for too long

Direct submission

Simplest method – specify directly the CE you want your jobs to run on Works with the “out-of-the-box” HTCondor configuration –i.e. after “yum install condor; service condor start”

Example 1 -bash-4.1$ cat grid3.sub universe = grid grid_resource = nordugrid t2arc01.physics.ox.ac.uk executable = test.sh Log=log.$(Cluster).$(Process) Output=out.$(Cluster).$(Process) Error=err.$(Cluster).$(Process) should_transfer_files = YES when_to_transfer_output = ON_EXIT x509userproxy=/tmp/x509up_u33133 queue 1 -bash-4.1$ condor_submit grid3.sub Submitting job(s). 1 job(s) submitted to cluster

Example 1 -bash-4.1$ condor_q -- Submitter: lcgui03.gridpp.rl.ac.uk : : lcgui03.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD alahiff 5/30 15: :00:00 I test.sh 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended -bash-4.1$ condor_q -grid -- Submitter: lcgui03.gridpp.rl.ac.uk : : lcgui03.gridpp.rl.ac.uk ID OWNER STATUS GRID->MANAGER HOST GRID_JOB_ID alahiff INLRMS:R nordugrid->[?] t2arc01.phys N84MDm8BdAknD0VB Check job status

Example 1 -bash-4.1$ cat log ( ) 05/30 15:42:27 Job submitted from host: ( ) 05/30 15:42:36 Job submitted to grid resource GridResource: nordugrid t2arc01.physics.ox.ac.uk GridJobId: nordugrid t2arc01.physics.ox.ac.uk N84MDm8BdAknD0VBFmzXO77mABFKDmABFKDmpmMKDmABFKDmBX8aZn ( ) 05/30 15:57:35 Job executing on host: nordugrid t2arc01.physics.ox.ac.uk... Log file

Example 2 -bash-4.1$ cat grid3.sub universe = grid executable = test.sh Log=log.$(Cluster).$(Process) Output=out.$(Cluster).$(Process) Error=err.$(Cluster).$(Process) should_transfer_files = YES when_to_transfer_output = ON_EXIT x509userproxy=/tmp/x509up_u33133 grid_resource = nordugrid arc-ce01.gridpp.rl.ac.uk queue 1 grid_resource = nordugrid heplnv146.pp.rl.ac.uk queue 1 grid_resource = nordugrid t2arc01.physics.ox.ac.uk queue 1 grid_resource = cream sge long queue 1

Example 2 -bash-4.1$ condor_submit grid3.sub Submitting job(s) job(s) submitted to cluster bash-4.1$ condor_q -grid -- Submitter: lcgui03.gridpp.rl.ac.uk : : lcgui03.gridpp.rl.ac.uk ID OWNER STATUS GRID->MANAGER HOST GRID_JOB_ID alahiff INLRMS:Q nordugrid->[?] arc-ce01.gri rLvKDmUsfAknCIXD alahiff INLRMS:R nordugrid->[?] heplnv146.pp uLXMDmQsfAknOGRD alahiff INLRMS:R nordugrid->[?] t2arc01.phys piXMDmQsfAknD0VB alahiff REALLY-RUN cream->sge/long cccreamceli CREAM

Example 2 -bash-4.1$ condor_status -grid Name NumJobs Allowed Wanted RunningJobs IdleJobs cream nordugrid arc-ce01.gridpp.rl.ac.uk nordugrid heplnv146.pp.rl.ac.uk nordugrid t2arc01.physics.ox.ac.uk

Job router

Optional daemon –Transforms jobs from one type to another according to a configurable policy –E.g. can transform vanilla jobs to grid jobs Configuration –Limits on jobs & idle jobs per CE –What to do with held jobs –When to kill jobs –How to detect failed jobs –Black-hole throttling –Can specify arbitrary script which updates routes E.g. create routes for all CMS CEs from the BDII

Job router Most basic configuration: just a list of CEs JOB_ROUTER_ENTRIES = \ [ GridResource = "nordugrid arc-ce01.gridpp.rl.ac.uk"; \ name = "arc-ce01.gridpp.rl.ac.uk"; ] \ [ GridResource = "nordugrid arc-ce02.gridpp.rl.ac.uk"; \ name = "arc-ce02.gridpp.rl.ac.uk"; ] \ [ GridResource = "nordugrid arc-ce03.gridpp.rl.ac.uk"; \ name = "arc-ce03.gridpp.rl.ac.uk"; ] \ [ GridResource = "nordugrid arc-ce04.gridpp.rl.ac.uk"; \ name = "arc-ce04.gridpp.rl.ac.uk"; ]

Job router Example – submit 20 jobs –Checking status: summary by route: -bash-4.1$ condor_router_q -S JOBS ST Route GridResource 5 I arc-ce01.gridpp.rl.ac.uk nordugrid arc-ce01.gridpp.rl.ac.uk 1 R arc-ce02.gridpp.rl.ac.uk nordugrid arc-ce02.gridpp.rl.ac.uk 4 I arc-ce02.gridpp.rl.ac.uk nordugrid arc-ce02.gridpp.rl.ac.uk 1 R arc-ce03.gridpp.rl.ac.uk nordugrid arc-ce03.gridpp.rl.ac.uk 4 I arc-ce03.gridpp.rl.ac.uk nordugrid arc-ce03.gridpp.rl.ac.uk 1 R arc-ce04.gridpp.rl.ac.uk nordugrid arc-ce04.gridpp.rl.ac.uk 4 I arc-ce04.gridpp.rl.ac.uk nordugrid arc-ce04.gridpp.rl.ac.uk

Job router –Checking status: individual jobs -bash-4.1$ condor_q -grid -- Submitter: lcgvm21.gridpp.rl.ac.uk : : lcgvm21.gridpp.rl.ac.uk ID OWNER STATUS GRID->MANAGER HOST GRID_JOB_ID alahiff IDLE nordugrid->[?] arc-ce03.gri HL9NDmGVt9jnzEJD alahiff PREPARING nordugrid->[?] arc-ce04.gri 18kLDmMVt9jnE6QD alahiff INLRMS:Q nordugrid->[?] arc-ce01.gri 5VKKDmHVt9jnCIXD alahiff IDLE nordugrid->[?] arc-ce02.gri oKSLDmHVt9jnc1XD alahiff IDLE nordugrid->[?] arc-ce03.gri gpeKDmHVt9jnzEJD alahiff PREPARING nordugrid->[?] arc-ce04.gri 7iFMDmMVt9jnE6QD alahiff INLRMS:Q nordugrid->[?] arc-ce01.gri 5fmKDmHVt9jnCIXD alahiff PREPARING nordugrid->[?] arc-ce04.gri KTeMDmMVt9jnE6QD alahiff IDLE nordugrid->[?] arc-ce02.gri 7QBLDmHVt9jnc1XD alahiff IDLE nordugrid->[?] arc-ce03.gri sZPKDmHVt9jnzEJD …

Match making

Add ClassAds corresponding to grid resources –Jobs can then be matched to them, just like jobs are matched to worker nodes Could use simple cron to update from BDII –Not everything from BDII (!), just the basic essentials GlueCEUniqueID, GlueHostMainMemoryRAMSize, …

Match making Can run queries to see what CEs exist $ condor_status -constraint 'regexp("ox",GlueCEUniqueID) == True' -autoformat GlueCEUniqueID t2arc01.physics.ox.ac.uk:2811/nordugrid-Condor-gridAMD t2ce02.physics.ox.ac.uk:8443/cream-pbs-shortfive t2ce04.physics.ox.ac.uk:8443/cream-pbs-shortfive t2ce06.physics.ox.ac.uk:8443/cream-pbs-shortfive Can then do things like this to select sites in JDLs requirements = RegExp("gridpp.rl.ac.uk", GlueCEUniqueId) Looks similar to basic EMI WMS functionality, but –Much simpler –More transparent –Easier to debug

Conclusion With the eventual decommissioning of EMI WMS, non- LHC VOs need alternatives Do all non-LHC VOs really make use of advanced functionality in EMI WMS? –Some VOs submitting to RAL clearly don’t –Some others do use it but might not really need to More users becoming familiar with HTCondor as –Sites in the UK migrating to it –CERN may migrate it Submission to the grid is no different to local submission

Conclusion Lots of possibilities –Submission to specific CEs from UI with HTCondor installed Already works today –Route jobs from user’s UI (job router, flocking or remote submission) to a central WMS-like service based on HTCondor? –A service similar to OSG Connect? HTCondor submit UI where people can login & submit jobs to many resources –…