Presentation is loading. Please wait.

Presentation is loading. Please wait.

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

Similar presentations


Presentation on theme: "Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor."— Presentation transcript:

1 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu New Ways to Fetch Work The new hook infrastructure in Condor 7.1.*

2 CondorProject.org What’s the problem? › Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system. Specialized scheduling needs Jobs live in their own database or other storage than a Condor job queue

3 CondorProject.org Fetch vs. push › Instead of trying to get these jobs into a condor_schedd, or try to push them to the condor_startd, just get the condor_startd to fetch (pull) the work Lower latency than the overhead of matchmaking and the schedd Fetching only requires an outbound network connection which makes life easier if you “glide-in” behind a firewall

4 CondorProject.org What’s the dumb solution? › Put code directly into the condor_startd that can talk directly to the other scheduling system(s) We’d have to support other protocols We’d have to link even more libraries and dependencies into our code Very inflexible

5 CondorProject.org Another dumb solution… › “Make it a web service!” › Mostly the same problems: What protocol? What format to describe the jobs? Add a dependency on libCurl? › What if I don’t want a webserver to be handling my jobs? › Security? Authentication? Privacy?

6 CondorProject.org Our solution (hopefully not dumb) › Make a system of “hooks” that you can plug into: A hook is a point during the life-cycle of a job where the Condor daemons will invoke an external program The hook invocation points have to be hard-coded into Condor, but then anyone can implement their own hooks to do what they want

7 CondorProject.org Why isn’t that dumb? › All the logic, code, libraries, etc, to fetch jobs from any given system lives completely outside of the Condor source and binaries › New hooks can be installed without a new version of Condor › No new library dependencies for us › Hooks are written by people who know what they’re doing…

8 CondorProject.org How does Condor communicate with hooks? › Passing around ASCII ClassAds via standard input and standard output › Some hooks get control data via a command-line argument (argv) › Hooks can be written in any language (scripts, binaries, whatever you want) so long as you can read/write STDIN/OUT › Decades of UNIX wisdom can’t be wrong!

9 CondorProject.org What hooks are available? › Hooks for fetching work (condor_startd): FETCH_JOB REPLY_FETCH EVICT_CLAIM › Hooks for running jobs (condor_starter): PREPARE_JOB UPDATE_JOB_INFO JOB_EXIT

10 CondorProject.org HOOK_FETCH_JOB › Invoked by the startd whenever it wants to try to fetch new work FetchWorkDelay expression › Hook gets a current copy of the slot ClassAd › Hook prints the job ClassAd to STDOUT › If STDOUT is empty, there’s no work

11 CondorProject.org HOOK_REPLY_FETCH › Invoked by the startd once it decides what to do with the job ClassAd returned by HOOK_FETCH_WORK › Gives your external system a chance to know what happened › argv[1]: “accept” or “reject” › Gets a copy of slot and job ClassAds › Condor ignores all output › Optional hook

12 CondorProject.org HOOK_EVICT_CLAIM › Invoked if the startd has to evict a claim that’s running fetched work › Informational only: you can’t stop or delay this train once it’s left the station › STDIN: Both slot and job ClassAds › STDOUT: > /dev/null

13 CondorProject.org HOOK_PREPARE_JOB › Invoked by the condor_starter when it first starts up (only if defined) › Opportunity to prepare the job execution environment Transfer input files, executables, etc. › INPUT: both slot and job ClassAds › OUTPUT: ignored, but starter won’t continue until this hook exits › Not specific to fetched work

14 CondorProject.org HOOK_UPDATE_JOB_INFO › Periodically invoked by the starter to let you know what’s happening with the job › INPUT: both ClassAds Job ClassAd is updated with additional attributes computed by the starter: ImageSize, JobState, RemoteUserCpu, etc. › OUTPUT: ignored

15 CondorProject.org HOOK_JOB_EXIT › Invoked by the starter whenever the job exits for any reason › Argv[1] indicates what happened: “exit”: Died a natural death “evict”: Booted off prematurely by the startd (PREEMPT == TRUE, condor_off, etc) “remove”: Removed by condor_rm “hold”: Held by condor_hold

16 CondorProject.org HOOK_JOB_EXIT … › “HUH!?! condor_rm? What are you talking about?” The starter hooks can be defined even for regular Condor jobs*, local universe, etc. › INPUT: copy of the job ClassAd with extra attributes about what happened: ExitCode, JobDuration, etc. › OUTPUT: Ignored * Except for dumb exceptions… the schedd doesn’t distinguish rm vs. hold when telling the starter to go away (yet). Argh!

17 CondorProject.org Defining hooks › Each slot can have its own hook ”keyword” Prefix for config file parameters Can use different sets of hooks to talk to different external systems on each slot Global keyword used when the per-slot keyword is not defined › Keyword is inserted by the startd into its copy of the job ClassAd and given to the starter

18 CondorProject.org Defining hooks: example # Most slots fetch work from the database system STARTD_JOB_HOOK_KEYWORD = DB # Slot4 fetches and runs work from a web service SLOT4_JOB_HOOK_KEYWORD = WEB # The database system needs to both provide work and # know the reply for each attempted claim DB_DIR = /usr/local/condor/fetch/db DATABASE_HOOK_FETCH_WORK = $(DB_DIR)/fetch_work.php DATABASE_HOOK_REPLY_FETCH = $(DB_DIR)/reply_fetch.php # The web system only needs to fetch work WEB_DIR = /usr/local/condor/fetch/web WEB_HOOK_FETCH_WORK = $(WEB_DIR)/fetch_work.php

19 CondorProject.org Semantics of fetched jobs › Condor_startd treats them just like any other kind of job: All the standard resource policy expressions apply (START, SUSPEND, PREEMPT, RANK, etc). Fetched jobs can coexist in the same pool with jobs pushed by Condor, COD, etc. Fetched work != Backfill

20 CondorProject.org Semantics continued › If the startd is unclaimed and fetches a job, a claim is created › If that job completes, the claim is reused and the startd fetches again › Keep fetching until either: The claim is evicted by Condor The fetch hook returns no more work

21 CondorProject.org Limitations for fetched jobs › No schedd/shadow means no “standard universe” for checkpointing, migration, and remote system calls Could use stand-alone checkpointing Application-specific checkpointing › Other features that are unavailable: User policy expressions (e.g. periodic hold) No DAGMan (you’re on your own) …

22 CondorProject.org Limitations of the hooks › If the starter can’t run your fetched job because your ClassAd is bogus, no hook is invoked to tell you about it We need a HOOK_STARTER_FAILURE › No hook when the starter is about to evict you (so you can checkpoint) Can implement this yourself with a wrapper script and the SoftKillSig attribute

23 CondorProject.org More information › New section in the Condor 7.1 manual: Chapter 4: Miscellaneous Concepts 4.4: Job Hooks › http://www.cs.wisc.edu/condor/manual/v 7.1/4_4Job_Hooks.html › Any questions?


Download ppt "Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor."

Similar presentations


Ads by Google