Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

Advertisements

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.

1 Using Condor An Introduction ICE 2008.

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

1 Using Condor An Introduction ICE 2010.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Grid job submission using HTCondor Andrew Lahiff.

Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Grid Compute Resources and Job Management New Mexico Grid School – April 9, 2009 Marco Mambelli – University of Chicago

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

LIGO-G W Use of Condor by the LIGO Scientific Collaboration Gregory Mendell, LIGO Hanford Observatory On behalf of the LIGO Scientific Collaboration.

LIGO-G Z1 Using Condor for Large Scale Data Analysis within the LIGO Scientific Collaboration Duncan Brown California Institute of Technology.

LIGO: The Laser Interferometer Gravitational-wave Observatory Sarah Caudill Louisiana State University Physics Graduate Student LIGO Hanford LIGO Livingston.

Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.

Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.

Christina Koch Research Computing Facilitators

Intermediate HTCondor: More Workflows Monday pm

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

Intermediate HTCondor: Workflows Monday pm

An Introduction to Workflows with DAGMan

Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison

Grid Compute Resources and Job Management

Using Stork An Introduction Condor Week 2006

Troubleshooting Your Jobs

What’s New in DAGMan HTCondor Week 2013

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Frieda meets Pegasus-WMS

Troubleshooting Your Jobs

Thursday AM, Lecture 1 Lauren Michael

Presentation transcript:

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison

OSG Summer School 2015 Before we begin… Any questions on the lectures or exercises up to this point? 2

OSG Summer School 2015 Quick Review: 1 3 Universe = vanilla Executable = runme.sh Arguments = 1 2 true Output = out Error = err Log = log queue

OSG Summer School 2015 Quick Review: 2 4 Universe = vanilla Executable = runme.sh Arguments = 1 2 true Output = out.$(PROCESS) Error = err.$(PROCESS) Log = log.$(PROCESS) Queue 10000

OSG Summer School 2015 Workflows Often, you don’t have independent tasks! Common example:  You want to analyze a set of images 1.You need to generate N images (once) 2.You need to analyze all N images  One job per image 3.You need to summarize all results (once) 5

OSG Summer School 2015 Do you really want to do this manually? 6 Generate Analyze Summarize

OSG Summer School 2015 Workflows: The HTC definition Workflow: A graph of jobs to run: one or more jobs must succeed before one or more others can start running 7

OSG Summer School 2015 Example of a LIGO Inspiral DAG 8

OSG Summer School 2015 DAGMan DAGMan: HTCondor’s workflow manager Directed Acyclic Graph (DAG) Manager (Man) Allows you to specify the dependencies between your HTCondor jobs Manages the jobs and their dependencies That is, it manages a workflow of HTCondor jobs 9

OSG Summer School 2015 What is a DAG? A DAG is the structure used by DAGMan to represent these dependencies. Each job is in a node in the DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops! A BC D OK: A BC Not OK: 10

OSG Summer School 2015 So, what’s in a node? 11 A BC D (optional pre-script) Job (optional post-script)

OSG Summer School 2015 Defining a DAG A DAG is defined by a.dag file, listing each of its nodes and their dependencies. For example: # Comments are good Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job BJob C Job D 12

OSG Summer School 2015 DAG Files…. This complete DAG has five files Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D One DAG File:Four Submit Files: Universe = Vanilla Executable = analysis… Universe = … 13

OSG Summer School 2015 Submitting a DAG To start your DAG, just run condor_submit_dag with your.dag file, and HTCondor will start a DAGMan process to manage your jobs: % condor_submit_dag diamond.dag condor_submit_dag submits a Scheduler Universe job with DAGMan as the executable Thus the DAGMan daemon itself runs as an HTCondor job, so you don’t have to baby-sit it 14

OSG Summer School 2015 DAGMan itself is a condor job with a job id, so % condor_rm job_id_of_dagman % condor_hold job_id_of_dagman % condor_q –dag # is magic DAGMan submits jobs, one cluster per node Don’t confuse dagman as job with jobs of dagman DAGMan is a HTCondor job 15

OSG Summer School 2015 DAGMan Running a DAG DAGMan acts as a job scheduler, managing the submission of your jobs to HTCondor based on the DAG dependencies HTCondor Job Queue B C D A A.dag File 16

OSG Summer School 2015 DAGMan Running a DAG (cont’d) DAGMan submits jobs to HTCondor at the appropriate times For example, after A finishes, it submits B & C HTCondor Job Queue C D B C B A 17

OSG Summer School 2015 DAGMan Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits HTCondor Job Queue C D A B 18

OSG Summer School 2015 DAGMan Successes and Failures A job fails if it exits with a non-zero exit code In case of a job failure, DAGMan runs other jobs until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG HTCondor Job Queue X D A B Rescue File 19

OSG Summer School 2015 DAGMan Recovering a DAG Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG  Another example of reliability for HTC! HTCondor Job Queue C D A B Rescue File C 20

OSG Summer School 2015 DAGMan Recovering a DAG (cont’d) Once that job completes, DAGMan will continue the DAG as if the failure never happened HTCondor Job Queue C D A B D 21

OSG Summer School 2015 DAGMan & Fancy Features DAGMan doesn’t have a lot of “fancy features”  No loops  Not much assistance in writing very large DAGs (script it yourself) Focus is on solid core  Add the features people need in order to run large DAGs well  People build systems on top of DAGMan 22

OSG Summer School 2015 Related Software Pegasus:  Writes DAGs based on abstract description  Runs DAG on appropriate resource (HTCondor, OSG, EC2…)  Locates data, coordinates execution  Uses DAGMan, works with large workflows Makeflow:  User writes make file, not DAG  Works with HTCondor, SGE, Work Queue…  Handles data transfers to remote systems  Does not use DAGMan 23

OSG Summer School 2015 DAGMan: Reliability For each job, HTCondor generates a log file DAGMan reads this log to see what has happened If DAGMan dies (crash, power failure, etc…)  HTCondor will restart DAGMan  DAGMan re-reads log file  DAGMan knows everything it needs to know  Principle: DAGMan can recover state from files and without relying on a service (HTCondor queue, database…) Recall: HTC requires reliability! 24

OSG Summer School 2015 Let’s try it out! Exercises with DAGMan. 25

OSG Summer School 2015 Questions? Questions? Comments? Feel free to ask me questions later: 26

OSG Summer School 2015 Bonus Material… 27

OSG Summer School 2015 Use of HTCondor by the LIGO Scientific Collaboration HTCondor handles 10’s of millions of jobs per year running on the LDG, and up to 500k jobs per DAG. HTCondor standard universe check pointing widely used, saving us from having to manage this. At Caltech, 30 million jobs processed using 22.8 million CPU hrs. on 1324 CPUs in last 30 months. For example, to search 1 yr. of data for GWs from the inspiral of binary neutron star and black hole systems takes ~2 million jobs, and months to run on several thousand ~2.6 GHz nodes. (Statement from 2010—”last 30 months” isn’t from now. Also, I think they do up to 1 million jobs per DAG now.) 28

OSG Summer School 2015 Example workflow: Bioinformatics 29 From Mason, Sanders, State (Yale)

OSG Summer School 2015 Example workflow: Astronomy 30 From Berriman & Good (JPAC)