Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.
WS-JDML: A Web Service Interface for Job Submission and Monitoring Stephen M C Gough William Lee London e-Science Centre Department of Computing, Imperial.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
Georgia Institute of Technology Workshop for CS-AP Teachers Chapter 3 Advanced Object-Oriented Concepts.
Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB JavaForum.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Iteration. Adding CDs to Vic Stack In many of the programs you write, you would like to have a CD on the stack before the program runs. To do this, you.
Grid Computing I CONDOR.
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
Triggers and Stored Procedures in DB 1. Objectives Learn what triggers and stored procedures are Learn the benefits of using them Learn how DB2 implements.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Chapter 5 Files and Exceptions I. "The Practice of Computing Using Python", Punch & Enbody, Copyright © 2013 Pearson Education, Inc. What is a file? A.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
CSC 322 Operating Systems Concepts Lecture - 7: by Ahmed Mumtaz Mustehsan Special Thanks To: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB Markus.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.
Condor Project Computer Sciences Department University of Wisconsin-Madison Using New Features in Condor 7.2.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Five todos when moving an application to distributed HTC.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Things you may not know about HTCondor
Things you may not know about HTCondor
Distribution and components
Building Grids with Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Basic Grid Projects – Condor (Part I)
The Condor JobRouter.
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
SPL – PS1 Introduction to C++.
Job Submission Via File Transfer
Presentation transcript:

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor 7.1.*

CondorProject.org What’s the problem? › Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system. Specialized scheduling needs Jobs live in their own database or other storage than a Condor job queue

CondorProject.org Fetch vs. push › Instead of trying to get these jobs into a condor_schedd, or try to push them to the condor_startd, just get the condor_startd to fetch (pull) the work Lower latency than the overhead of matchmaking and the schedd Fetching only requires an outbound network connection which makes life easier if you “glide-in” behind a firewall

CondorProject.org What’s the dumb solution? › Put code directly into the condor_startd that can talk directly to the other scheduling system(s) We’d have to support other protocols We’d have to link even more libraries and dependencies into our code Very inflexible

CondorProject.org Another dumb solution… › “Make it a web service!” › Mostly the same problems: What protocol? What format to describe the jobs? Add a dependency on libCurl? › What if I don’t want a webserver to be handling my jobs? › Security? Authentication? Privacy?

CondorProject.org Our solution (hopefully not dumb) › Make a system of “hooks” that you can plug into: A hook is a point during the life-cycle of a job where the Condor daemons will invoke an external program The hook invocation points have to be hard-coded into Condor, but then anyone can implement their own hooks to do what they want

CondorProject.org Why isn’t that dumb? › All the logic, code, libraries, etc, to fetch jobs from any given system lives completely outside of the Condor source and binaries › New hooks can be installed without a new version of Condor › No new library dependencies for us › Hooks are written by people who know what they’re doing…

CondorProject.org How does Condor communicate with hooks? › Passing around ASCII ClassAds via standard input and standard output › Some hooks get control data via a command-line argument (argv) › Hooks can be written in any language (scripts, binaries, whatever you want) so long as you can read/write STDIN/OUT › Decades of UNIX wisdom can’t be wrong!

CondorProject.org What hooks are available? › Hooks for fetching work (condor_startd): FETCH_JOB REPLY_FETCH EVICT_CLAIM › Hooks for running jobs (condor_starter): PREPARE_JOB UPDATE_JOB_INFO JOB_EXIT

CondorProject.org HOOK_FETCH_JOB › Invoked by the startd whenever it wants to try to fetch new work FetchWorkDelay expression › Hook gets a current copy of the slot ClassAd › Hook prints the job ClassAd to STDOUT › If STDOUT is empty, there’s no work

CondorProject.org HOOK_REPLY_FETCH › Invoked by the startd once it decides what to do with the job ClassAd returned by HOOK_FETCH_WORK › Gives your external system a chance to know what happened › argv[1]: “accept” or “reject” › Gets a copy of slot and job ClassAds › Condor ignores all output › Optional hook

CondorProject.org HOOK_EVICT_CLAIM › Invoked if the startd has to evict a claim that’s running fetched work › Informational only: you can’t stop or delay this train once it’s left the station › STDIN: Both slot and job ClassAds › STDOUT: > /dev/null

CondorProject.org HOOK_PREPARE_JOB › Invoked by the condor_starter when it first starts up (only if defined) › Opportunity to prepare the job execution environment Transfer input files, executables, etc. › INPUT: both slot and job ClassAds › OUTPUT: ignored, but starter won’t continue until this hook exits › Not specific to fetched work

CondorProject.org HOOK_UPDATE_JOB_INFO › Periodically invoked by the starter to let you know what’s happening with the job › INPUT: both ClassAds Job ClassAd is updated with additional attributes computed by the starter: ImageSize, JobState, RemoteUserCpu, etc. › OUTPUT: ignored

CondorProject.org HOOK_JOB_EXIT › Invoked by the starter whenever the job exits for any reason › Argv[1] indicates what happened: “exit”: Died a natural death “evict”: Booted off prematurely by the startd (PREEMPT == TRUE, condor_off, etc) “remove”: Removed by condor_rm “hold”: Held by condor_hold

CondorProject.org HOOK_JOB_EXIT … › “HUH!?! condor_rm? What are you talking about?” The starter hooks can be defined even for regular Condor jobs*, local universe, etc. › INPUT: copy of the job ClassAd with extra attributes about what happened: ExitCode, JobDuration, etc. › OUTPUT: Ignored * Except for dumb exceptions… the schedd doesn’t distinguish rm vs. hold when telling the starter to go away (yet). Argh!

CondorProject.org Defining hooks › Each slot can have its own hook ”keyword” Prefix for config file parameters Can use different sets of hooks to talk to different external systems on each slot Global keyword used when the per-slot keyword is not defined › Keyword is inserted by the startd into its copy of the job ClassAd and given to the starter

CondorProject.org Defining hooks: example # Most slots fetch work from the database system STARTD_JOB_HOOK_KEYWORD = DB # Slot4 fetches and runs work from a web service SLOT4_JOB_HOOK_KEYWORD = WEB # The database system needs to both provide work and # know the reply for each attempted claim DB_DIR = /usr/local/condor/fetch/db DATABASE_HOOK_FETCH_WORK = $(DB_DIR)/fetch_work.php DATABASE_HOOK_REPLY_FETCH = $(DB_DIR)/reply_fetch.php # The web system only needs to fetch work WEB_DIR = /usr/local/condor/fetch/web WEB_HOOK_FETCH_WORK = $(WEB_DIR)/fetch_work.php

CondorProject.org Semantics of fetched jobs › Condor_startd treats them just like any other kind of job: All the standard resource policy expressions apply (START, SUSPEND, PREEMPT, RANK, etc). Fetched jobs can coexist in the same pool with jobs pushed by Condor, COD, etc. Fetched work != Backfill

CondorProject.org Semantics continued › If the startd is unclaimed and fetches a job, a claim is created › If that job completes, the claim is reused and the startd fetches again › Keep fetching until either: The claim is evicted by Condor The fetch hook returns no more work

CondorProject.org Limitations for fetched jobs › No schedd/shadow means no “standard universe” for checkpointing, migration, and remote system calls Could use stand-alone checkpointing Application-specific checkpointing › Other features that are unavailable: User policy expressions (e.g. periodic hold) No DAGMan (you’re on your own) …

CondorProject.org Limitations of the hooks › If the starter can’t run your fetched job because your ClassAd is bogus, no hook is invoked to tell you about it We need a HOOK_STARTER_FAILURE › No hook when the starter is about to evict you (so you can checkpoint) Can implement this yourself with a wrapper script and the SoftKillSig attribute

CondorProject.org More information › New section in the Condor 7.1 manual: Chapter 4: Miscellaneous Concepts 4.4: Job Hooks › 7.1/4_4Job_Hooks.html › Any questions?