Download presentation
Presentation is loading. Please wait.
Published byDorothy Elliott Modified over 8 years ago
1
Peter Couvares Computer Sciences Department University of Wisconsin-Madison pfc@cs.wisc.edu http://www.cs.wisc.edu/condor Condor DAGMan: Introduction & Update
2
http://www.cs.wisc.edu/condor 2 DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)
3
http://www.cs.wisc.edu/condor 3 Why is This Important? › Most real science involves complex sequences of tasks – on many resources at many sites. E.g., move data, compute, check, move back, etc. › … and many types of jobs working together Condor, Grid (Condor-G), MPI, shell scripts, etc. › Failures are a certainty, so recoverability of the sequence – not just the jobs – is crucial.
4
http://www.cs.wisc.edu/condor 4 What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D
5
http://www.cs.wisc.edu/condor 5 Defining a DAG › A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D › each node will run the Condor or Grid job specified by its accompanying Condor submit file Job A Job BJob C Job D
6
http://www.cs.wisc.edu/condor 6 Submitting a DAG › To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag submits a Scheduler Universe job to run DAGMan under Condor… so DAGMan itself will be robust in case of failure, machine reboots, etc.
7
http://www.cs.wisc.edu/condor 7 DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue C D A A B.dag File
8
http://www.cs.wisc.edu/condor 8 DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue C D B C B A
9
http://www.cs.wisc.edu/condor 9 DAGMan Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File
10
http://www.cs.wisc.edu/condor 10 DAGMan Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C
11
http://www.cs.wisc.edu/condor 11 DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B
12
http://www.cs.wisc.edu/condor 12 Additional DAGMan Features › Provides other knobs handy for job management… nodes can have PRE & POST scripts job submission can be “throttled” NEW: failed nodes can be automatically re-tried a configurable number of times
13
http://www.cs.wisc.edu/condor 13 PRE & POST Scripts › Executes locally on the submit host before or after job submission… › Example: # diamond.dag PRE A prepare-A.sh Job A a.sub Job B b.sub Job C c.sub Job D d.sub POST D double-check.sh Parent A Child B C Parent B C Child D › PRE/POST scripts are part of node PRE Job A Job BJob C Job D POST
14
http://www.cs.wisc.edu/condor 14 DAG “Throttling” › You can tell DAGMan to limit the maximum number of jobs it submits at any one time condor_submit_dag -maxjobs N useful for managing resource limitations (e.g., licenses) › You can also can limit the number of simultaneous PRE or POST scripts. Added after Vladimir Litvin’s 7000-node DAG started 7000 PRE scripts on his machine!
15
http://www.cs.wisc.edu/condor 15 Node RETRY › Tells DAGMan to re-run a node multiple times if necessary… › Example: # diamond.dag Job A a.sub Job B b.sub RETRY B 5 Job C c.sub RETRY C 5 Job D d.sub Parent A Child B C Parent B C Child D Job A Job BJob C Job D
16
http://www.cs.wisc.edu/condor 16 DAGMan Progress › Testing… lots of testing. 10,000+ node DAGs run smoothly Developed automated DAG testing tools to generate random DAGs and test for correct execution (Ning Lin & Will McDonald) Lots of bugs fixed
17
http://www.cs.wisc.edu/condor 17 DAGMan Progress (cont’d) › New features Improved logging (timestamps, etc.) More efficient recovery Node RETRY capability DAG info in condor_q (with –dag flag) Robust in more failure cases Recursive DAGs for conditional execution › DAGMan for Windows (Ray Pingree)
18
http://www.cs.wisc.edu/condor 18 DAGMan Success › DAGMan is becoming part of the common framework for running on the grid. Particle Physics Data Grid (PPDG) Grid Physics Network (GriPhyN) Many Super Computing 2001 demos more…
19
http://www.cs.wisc.edu/condor 19 DAGMan in the GriPhyN Architecture Application Planner Executor Catalog Services Info Services Policy/Security Monitoring Repl. Mgmt. Reliable Transfer Service Compute ResourceStorage Resource DAG DAGMAN, Kangaroo GRAMGridFTP; GRAM; SRM GSI, CAS MDS MCAT; GriPhyN catalogs GDMP MDS Globus diagram by Ian Foster (Argonne)
20
DAGMan in PPDG Tools diagram by Jim Amundson (Fermilab)
21
http://www.cs.wisc.edu/condor 21 What’s Next? › More flexible control of node execution Currently implicit: “all my parents returned 0”. Why not, “all parents returned 0 AND ran for more than two hours” or “parent A returned 0 and parent B returned 42”? › 1 st step: represent DAG nodes internally as ClassAds Allows DAGMan to decide when to run nodes based on arbitrary requirements
22
http://www.cs.wisc.edu/condor 22 What’s Next? (cont’d) › Extend DAGMan to utilize DaP Scheduler (DaP?) to intelligently schedule data transfers along with Condor and Condor-G jobs. DAGMan Condor-G Condor DaP Scheduler
23
http://www.cs.wisc.edu/condor 23 Thank You! › Interested in seeing more? Come to the DAGMan BoF Wednesday 9am - noon Room 3393, Computer Sciences (1210 W. Dayton St.) Email us: condor-admin@cs.wisc.edu Try it! http://www.cs.wisc.edu/condor
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.