Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
Cluster Computing at IQSS Alex Storer, Research Technology Consultant.
Advertisements

Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison
Blackbird: Accelerated Course Archives Using Condor with Blackboard Sam Hoover, IT Systems Architect Matt Garrett, System Administrator.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
1 Using Condor An Introduction ICE 2008.
Discussion 2: PA1 Siarhei Vishniakou CSE 100 Summer 2014.
Open Science Grid: More compute power Alan De Smet
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.
1 Instant Data Warehouse Utilities Extended (Again!!) 14/7/ Today I am pleased to announce the publishing of some fantastic new functionality for.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
1 Using Condor An Introduction ICE 2010.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
An Introduction to Designing, Executing and Sharing Workflows with Taverna Katy Wolstencroft myGrid University of Manchester IMPACT/Taverna Hackathon 2011.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
How to benefit from the International Summer School on Grid Computing 2009 Alain Roy.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
LIGO-G W Use of Condor by the LIGO Scientific Collaboration Gregory Mendell, LIGO Hanford Observatory On behalf of the LIGO Scientific Collaboration.
LIGO-G Z1 Using Condor for Large Scale Data Analysis within the LIGO Scientific Collaboration Duncan Brown California Institute of Technology.
LIGO: The Laser Interferometer Gravitational-wave Observatory Sarah Caudill Louisiana State University Physics Graduate Student LIGO Hanford LIGO Livingston.
HTCondor and Workflows: Tutorial HTCondor Week 2016 Kent Wenger.
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
Christina Koch Research Computing Facilitators
Intermediate HTCondor: More Workflows Monday pm
Condor DAGMan: Managing Job Dependencies with Condor
Operations Support Manager - Open Science Grid
Intermediate HTCondor: Workflows Monday pm
An Introduction to Workflows with DAGMan
Grid Compute Resources and Job Management
Using Stork An Introduction Condor Week 2006
User Defined Functions
HTCondor and Workflows: An Introduction HTCondor Week 2013
What’s New in DAGMan HTCondor Week 2013
Presentation transcript:

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison

OSG Summer School 2011 Before we begin… Any questions on the lectures or exercises up to this point? 2

OSG Summer School 2011 DAGMan DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. Example: “Don’t run job B until job A has completed successfully.” Directed Acyclic Graph Manager

OSG Summer School 2011 What is a DAG? A DAG is the data structure used by DAGMan to represent these dependencies. Each job is a node in the DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops! A BC D OK: A BC Not OK:

OSG Summer School 2011 Defining a DAG A DAG is defined by a.dag file, listing each of its nodes and their dependencies. For example: Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job BJob C Job D

OSG Summer School 2011 DAG Files…. This complete DAG has five files Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D One DAG File:Four Submit Files: Universe = Vanilla Executable = analysis… Universe = …

OSG Summer School 2011 Submitting a DAG To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan process which to begin running your jobs: % condor_submit_dag diamond.dag condor_submit_dag submits a Scheduler Universe job with DAGMan as the executable. Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

OSG Summer School 2011 DAGMan Running a DAG DAGMan acts as a scheduler, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue B C D A A.dag File

OSG Summer School 2011 DAGMan Running a DAG (cont’d) DAGMan holds & submits jobs to the Condor queue at the appropriate times. For example, after A finishes, it submits B & C. Condor Job Queue C D B C B A

OSG Summer School 2011 DAGMan Running a DAG (cont’d) In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File

OSG Summer School 2011 DAGMan Recovering a DAG Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C

OSG Summer School 2011 DAGMan Recovering a DAG (cont’d) Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor Job Queue C D A B D

OSG Summer School 2011 DAGMan Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B

OSG Summer School 2011 Example of a LIGO Inspiral DAG

OSG Summer School 2011 Use of Condor by the LIGO Scientific Collaboration Condor handles 10’s of millions of jobs per year running on the LDG, and up to 500k jobs per DAG. Condor standard universe check pointing widely used, saving us from having to manage this. At Caltech, 30 million jobs processed using 22.8 million CPU hrs. on 1324 CPUs in last 30 months. For example, to search 1 yr. of data for GWs from the inspiral of binary neutron star and black hole systems takes ~2 million jobs, and months to run on several thousand ~2.6 GHz nodes. (Statement from 2010—”last 30 months” isn’t from now. Also, I think they do up to 1 million jobs per DAG now.)

OSG Summer School 2011 DAGMan & Fancy Features DAGMan doesn’t have a lot of “fancy features”  No loops  Not much assistance in writing very large DAGs (script it yourself) Focus is on solid core  Add the features people need in order to run large DAGs well.  People build systems on top of DAGMan (c.f. Pegasus on your own time)

OSG Summer School 2011 DAGMan & Log Files For each job, Condor generates a log file DAGMan reads this log to see what has happened If DAGMan dies (crash, power failure, etc…)  Condor will restart DAGMan  DAGMan re-reads log file  DAGMan knows everything it needs to know  Principle: DAGMan can recover state from files and without relying on a service (Condor queue, database…)

OSG Summer School 2011 Advanced DAGMan Tricks Throttles and degenerative DAGs Sub-DAGs Pre and Post scripts: editing your DAG

OSG Summer School 2011 Throttles Failed nodes can be automatically retried a configurable number of times  Can retry N times  Can retry N times, unless a node returns specific exit code Throttles to control job submissions  Max jobs submitted  Max scripts running  These are important when working with large DAGs

OSG Summer School 2011 Degenerative DAG Submit DAG with:  200,000 nodes  No dependencies Use DAGMan to throttle the jobs  Condor is scalable, but it will have problems if you submit 200,000 jobs simultaneously  DAGMan can help you get scalability even if you don’t have dependencies A1A1 A2A2 A3A3 …

OSG Summer School 2011 Sub-DAG Idea: any given DAG node can be another DAG  SUBDAG External Name DAG-file DAG node will not complete until recursive DAG finishes, Interesting idea: A previous node could generate this DAG node Why?  Simpler DAG structure  Implement a fixed-length loop  Modify behavior on the fly

OSG Summer School 2011 Sub-DAG A BC D VW Z X Y

OSG Summer School 2011 DAGMan scripts DAGMan allows pre & post scripts  Don’t have to be scripts: any executable  Run before (pre) or after (post) job  Run on the same computer you submitted from Syntax: JOB A a.sub SCRIPT PRE A before-script $JOB SCRIPT POST A after-script $JOB $RETURN

OSG Summer School 2011 So What? Pre script can make decisions  Where should my job run? (Particularly useful to make job run in same place as last job.)  What should my job do (e.g. cmd-line arguments)  Generate Sub-DAG  Lazy decision making  Noop jobs Post script can change return value  DAGMan decides job failed in non-zero return value  Post-script can look at {error code, output files, etc} and return zero or non-zero based on deeper knowledge.

OSG Summer School 2011 Let’s try it out! Exercises with DAGMan. 25

OSG Summer School 2011 Questions? Questions? Comments? Feel free to ask me questions later: Alain Roy Upcoming sessions  Now – 3:15pm  Hands-on exercises  3:15pm – 3:30pm  Break  3:30pm – 4:55pm  Final Condor session 26