Condor DAGMan: Managing Job Dependencies with Condor

Slides:



Advertisements
Similar presentations
DAGMan Hands-On Kent Wenger University of Wisconsin Madison, Madison, WI.
Advertisements

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
A Computation Management Agent for Multi-Institutional Grids
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
1 Using Condor An Introduction ICE 2008.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.
1 Using Condor An Introduction ICE 2010.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
Grid job submission using HTCondor Andrew Lahiff.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.
Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
High Level Grid Services
Intermediate HTCondor: More Workflows Monday pm
Operations Support Manager - Open Science Grid
U.S. ATLAS Grid Production Experience
Intermediate HTCondor: Workflows Monday pm
Examples Example: UW-Madison CHTC Example: Global CMS Pool
An Introduction to Workflows with DAGMan
Grid Compute Resources and Job Management
Using Stork An Introduction Condor Week 2006
Building Grids with Condor
Job HTCondor’s quanta of work, like a Unix process Has many attributes
What’s New in DAGMan HTCondor Week 2013
湖南大学-信息科学与工程学院-计算机与科学系
Haiyan Meng and Douglas Thain
University of Wisconsin-Milwaukee
STORK: A Scheduler for Data Placement Activities in Grid
Process Description and Control
Genre1: Condor Grid: CSECCR
Using Condor An Introduction Paradyn/Condor Week 2002
Overview of Workflows: Why Use Them?
Condor-G Making Condor Grid Enabled
Frieda meets Pegasus-WMS
Presentation transcript:

Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan What is DAGMan? What is it good for? How does it work? What’s next? Condor DAGMan

DAGMan Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., “Don’t run job “B” until job “A” has completed successfully.”) In the simplest case… Condor DAGMan

Typical Scenarios Jobs whose output needs to be summarized or post-processed once they complete. Jobs that need data to be generated or pre-processed before they can use it. Jobs which require data to be staged to/from remote repositories before they start or after they finish. Condor DAGMan

What is a DAG? A DAG is the data structure used by DAGMan to represent these dependencies. Each job is a “node” in the DAG. Each node can have any number of “parents” or “children” (or neither) – as long as there are no loops! Job A Job B Job C Job D a DAG is the best data structure to represent a workflow of jobs with dependencies children may not run until their parents have finished – this is why the graph is a directed graph … there’s a direction to the flow of work In this example, called a “diamond” dag, job A must run first; when it finishes, jobs B and C can run together; when they are both finished, D can run; when D is finished the DAG is finished Loops, where two jobs are both descended from one another, are prohibited because they would lead to deadlock – in a loop, neither node could run until the other finished, and so neither would start – this restriction is what makes the graph acyclic Condor DAGMan

An Example DAG Jobs whose output needs to be summarized or post-processed once they complete: Job A Job B Job C Job D Condor DAGMan

Another Example DAG Jobs that need data to be generated or pre-processed before they can use it: Job A Job B Job C Job D Condor DAGMan

Defining a DAG A DAG is defined by a .dag file., listing all its nodes and any dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job B Job C Job D This is all it takes to specify the example “diamond” dag Condor DAGMan

Defining a DAG (cont’d) Each node in the DAG will run a Condor job, specified by a Condor submit file: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job B Job C Job D These are normal condor submit files, the same ones you would use to submit the jobs by hand Condor DAGMan

Submitting a DAG To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon & begin running your jobs: % condor_submit_dag diamond.dag The DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it. Just like any other Condor job, you get fault tolerance in case the machine crashes or reboots, or if there’s a network outage And you’re notified when it’s done, and whether it succeeded or failed % condor_q -- Submitter: foo.bar.edu : <128.105.175.133:1027> : foo.bar.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2456.0 user 3/8 19:22 0+00:00:02 R 0 3.2 condor_dagman -f - Condor DAGMan

Running a DAG DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. DAGMan A Condor Job Queue .dag File A B C First, job A will be submitted alone… D Condor DAGMan

Running a DAG (cont’d) DAGMan holds & submits jobs to the Condor queue at the appropriate times. DAGMan A Condor Job Queue B B C Once job A completes successfully, jobs B and C will be submitted at the same time… C D Condor DAGMan

Running a DAG (cont’d) In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. DAGMan A Condor Job Queue Rescue File B X If job C fails, DAGMan will wait until job B completes, and then will exit, creating a rescue file. Job D will not run. In its log, DAGMan will provide additional details of which node failed and why. D Condor DAGMan

Recovering a DAG Once the failed job is ready to be re-run, the Rescue file can be used to restore the prior state of the DAG. DAGMan A Condor Job Queue Rescue File B C Since jobs A and B have already completed, DAGMan will start by re-submitting job C C D Condor DAGMan

Recovering a DAG (cont’d) Once that job completes, DAGMan will continue the DAG as if the failure never happened. DAGMan A Condor Job Queue B C D D Condor DAGMan

Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits. DAGMan A Condor Job Queue B C If job C fails, DAGMan will wait until job B completes, and then will exit, creating a rescue file. D Condor DAGMan

Additional Features Provides some other handy features for job management… nodes can have PRE & POST scripts job submission can be “throttled” Condor DAGMan

PRE & POST Scripts Each node can have a PRE or POST script, executed as part of the node: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub PARENT A CHILD B C PARENT B C CHILD D Script PRE B stage-in.sh Script POST B stage-out.sh Job A PRE Job B POST Job C PRE and POST scripts will execute locally on the submitting machine, before the job is submitted or after it completes The PRE & POST scripts are part of the node – in other words, if any part of the node fails, the node was not successful and any children will not start Job D Condor DAGMan

Submit Throttling DAGMan can limit the maximum number of jobs it will submit to Condor at once: condor_submit_dag -maxjobs N Useful for managing resource limitations (e.g., storage). Ex: 1000 jobs, each of which require 1 GB of disk space, and you have 100 GB of disk. Condor DAGMan

Summary DAGMAN: manages dependencies, holding & running jobs only at the appropriate times monitors job progress is fault-tolerant is recoverable in case of job failure provides additional features to Condor Condor DAGMan

Future Work More sophisticated management of remote data transfer & staging to maximize CPU throughput. Keep the pipeline full! I.e., always try to have data ready when a CPU become available, while adhering to disk & network limitations. Integration with Kangaroo, etc. Better integration with Condor tools condor_q, etc. displaying DAG information Condor DAGMan

Conclusion Interested in seeing more? Come to the DAGMan demo Wednesday 9am - noon Room 3393, Computer Sciences (1210 W. Dayton St.) Email me: <pfc@cs.wisc.edu> Try it: http://www.cs.wisc.edu/condor Condor DAGMan