Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona, 2006
2 Agenda Extended user’s tutorial Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing Case studies, and a discussion of your application‘s needs
3 Some jobs have dependencies… Condor can help solve dependency problems
4 Frieda learns DAGMan Directed Acyclic Graph Manager DAGMan allows Frieda to specify the dependencies between her Condor jobs, so Condor manages the jobs automatically. Dependency example: Do not run job B until job A has completed successfully.
5 What is a DAG? Directed Acyclic Graph A DAG is the data structure used by DAGMan to represent dependencies A BC D
6 DAG Definitions DAGs have one or more nodes (or vertices). Dependencies are represented by arcs (or edges). These are arrows that go from parent to child). No cycles ! A BC D
7 Condor and DAGs Each node represents a Condor job Dependencies define the possible order of job execution Job A Job B Job C Job D
8 Defining a DAG to Condor A DAG input file defines a DAG: # file name: diamond.dag Job A a.submit Job B b.submit Job C c.submit Job D d.submit Parent A Child B C Parent B C Child D A BC D
9 Submit Description File For node B: # file name: # b.submit universe = vanilla executable = B input = B.in output = B.out error = B.err log = B.log queue For node C: # file name: # c.submit universe = standard executable = C input = C.in output = C.out error = C.err log = C.log queue
10 Submitting the DAG to Condor To submit the entire DAG, run condor_submit_dag diamond.dag condor_submit_dag creates a submit description file for DAGMan, and DAGMan itself is submitted as a Condor job!
11 a DAGMan requirement The submit description file for each job must specify a log file Log files may be separate or shared by different jobs within the DAG The log files are used to synchronize job submission
12 Nodes Job execution at a node is either successful or fails Based on the return value of the job 0 success not 0 failure A BC D
13 Advanced DAGMan Tricks Retry of a node Abort the entire DAG setting a variable, a VARS entry Throttles and DAGs PRE and POST scripts: editing the DAG Nested DAGs: loops and more
14 Retry Before a node is marked as failed... Retry N times. In the DAG input file: Retry C 4 (to rerun node C four times before calling the node failed) Retry N times, unless a node returns specific exit code. In the DAG input file: Retry C 4 UNLESS-EXIT 2
15 Abort the Entire DAG If a specific error value should cause the entire DAG to stop Place in the DAG input file: Abort-DAG-On B 3 Name of node Returned error code
16 VARS An entry in the DAG input file intended to reduce the number of unique submit description files needed defines a variable and value associated with a node use the value in a substitution macro
17 Root Invented Example: A Binary Tree A E B CD F Assume that a single executable processes each node. But, handling is different based on a node’s position as a left or right child.
18 The DAG Input File # tree example, file is tree.dag Job root node.submit Job A node.submit Vars A position=”left” Job B node.submit Vars B position=”right” Job C node.submit Vars C position=”left”... Parent root Child A B... Root A E B CD F
19 The Submit Description File # file name is node.submit executable = process.exe arguments = $(position) log = node.log queue The job at node A has the command line: process.exe left
20 Throttles Throttles to control number of job submissions at one time Maximum number of jobs submitted % condor_submit_dag –maxjobs 40 bigdag.dag Maximum number of jobs idle % condor_submit_dag –maxidle 10 bigdag.dag
21 Submit DAG with 200,000 nodes No dependencies between jobs Use DAGMan to throttle the jobs, because Condor is scalable, but will have problems with 200,000 simultaneous job submissions Throttling Example A1A1 A2A2 A3A3 … A
22 DAGMan scripts DAGMan allows PRE and/or POST scripts Not necessarily a script: any executable Run before (PRE) or after (POST) job Run on the submit machine In the DAG input file: Job A a.submit Script PRE A before-script Script POST A after-script
23 node A within the DAG before-script after-script Condor job described in a.submit
24 PRE script PRE script can make decisions Should I pass different arguments to the job? Should I change a submit description file? Lazy decision making
25 POST script POST script is always run, independent of the Condor job’s return value POST script can change return value DAGMan marks the node failed for a non- zero return value from the POST script POST script can look at error code or output files and return 0 (success) or non-zero (failure) based on deeper knowledge.
26 Pre-defined variables In the DAG input file: Job A a.submit Script PRE A before-script $JOB Script POST A after-script $JOB $RETURN (optional) arguments to script $JOB becomes the string that defines the node name $RETURN becomes the return value from the Condor job defined by the node
27 Script Throttles Throttles to control the number of scripts running at one time % condor_submit_dag –maxpre 10 bigdag.dag OR % condor_submit_dag –maxpost 30 bigdag.dag
28 Nested DAGs Idea: any DAG node can be a script that does: 1.Make decision 2.Create DAG input file 3.Call condor_submit_day –nosubmit 4.Outer DAG waits for inner DAG DAG node will not complete until the inner (nested) DAG finishes Why? Implement a fixed-length loop Modify behavior on the fly
29 Nested DAG Example A BC D V W Z X Y C is