Intermediate HTCondor: More Workflows Monday pm

Slides:

Advertisements

Similar presentations

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.

Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

DAGMan Hands-On Kent Wenger University of Wisconsin Madison, Madison, WI.

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.

1 Using Condor An Introduction ICE 2008.

Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Workflows: from Development to Production Thursday morning, 10:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

1 Using Condor An Introduction ICE 2010.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Grid job submission using HTCondor Andrew Lahiff.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Perl Subroutines User Input Perl on linux Forks and Pipes.

Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

HTCondor and Workflows: Tutorial HTCondor Week 2016 Kent Wenger.

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.

Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.

Christina Koch Research Computing Facilitators

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

Licenses and Interpreted Languages for DHTC Thursday morning, 10:45 am

Intermediate HTCondor: Workflows Monday pm

An Introduction to Workflows with DAGMan

IW2D migration to HTCondor

Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison

Grid Compute Resources and Job Management

Using Stork An Introduction Condor Week 2006

Cracking the Coding Interview

HTCondor and Workflows: An Introduction HTCondor Week 2013

Troubleshooting Your Jobs

What’s New in DAGMan HTCondor Week 2013

Final Project Details Note: To print these slides in grayscale (e.g., on a laser printer), first change the Theme Background to “Style 1” (i.e., dark.

Introduction to High Throughput Computing and HTCondor

Genre1: Condor Grid: CSECCR

Multi-modules programming

The Condor JobRouter.

Late Materialization has (lately) materialized

Troubleshooting Your Jobs

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison

Before we begin… Any questions on the lectures or exercises up to this point?

Life cycle of HTCondor Job Held Idle Running Complete Submit file Suspend Suspend Suspend History file

Life cycle of HTCondor Job Held Output files xfered here Idle Running Complete Submit file Suspend Suspend Suspend History file

What about long running job? Held Idle Running Complete Submit file Suspend Suspend Suspend History file

What about long running job? Held Idle Running Complete Submit file output files removed Suspend Suspend Suspend History file

WHEN_TO_TRANSFER_OUTPUT Universe = vanilla Executable = gronk should_transfers_files = yes WHEN_TO_TRANSFER_OUTPUT = ON_EXIT_OR_EVICT Arguments = $(OUTPUT) queue

ON_EXIT_OR_EVICT Submit file output files saved! History file Held Idle Running Complete Submit file output files saved! Suspend Suspend Suspend History file

Advanced DAGMan Tricks DAGMan Variables DAGs without dependencies Throttles Sub-DAGs Retries Pre and Post scripts: editing your DAG SPLICes: DAGs as subroutines

Throttles Throttles to control job submissions Max jobs idle condor_submit_dag -maxidle XX work.dag Max_jobs_submitted Max scripts running condor_submit_dag -maxpre XX -maxpost XX Useful for “big bag of tasks” Schedd holds everything in memory

DAGMan variables # Diamond dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D

DAGMan variables (Cont) # Diamond dag Job A a.sub Job B a.sub Job C a.sub Job D a.sub VARS A OUTPUT=“A.out” VARS B OUTPUT=“B.out” VARS C OUTPUT=“C.out” VARS D OUTPUt=“D.out” Parent A Child B C Parent B C Child D

DAGMan variables (cont) # a.sub Universe = vanilla Executable = gronk Arguments = $(OUTPUT) queue

Retries Failed nodes can be automatically retried a configurable number of times Helps when jobs randomly crash Job A a.sub Job B b.sub Job C c.sub Job D d.sub RETRY D 5 Parent A Child B C Parent B C Child D

DAGs without dependencies Submit DAG with: 200,000 nodes No dependencies Use DAGMan to throttle the job submissions: HTCondor is scalable, but it will have problems if you submit 200,000 jobs simultaneously A1 A2 A3 …

Shishkabob DAG Used for breaking long jobs into short Easier for scheduling A1 A2 A3

Sub-DAG Idea: any given DAG node can be another DAG SUBDAG External Name DAG-file DAG node will not complete until sub-dag finishes Interesting idea: A previous node could generate this DAG node Simpler DAG structure Implement a fixed-length loop Modify behavior on the fly

Sub-DAG A B C D V W X Y Z

Subdag Syntax # Diamond dag Job A a.sub Job B b.sub SUBDAG EXTERNAL C c.dag Job D d.sub Parent A Child B C Parent B C Child D

DAGMan scripts DAGMan allows pre & post scripts Syntax: Run before (pre) or after (post) job Run on the same computer you submitted from Don’t have to be scripts: any executable Syntax: JOB A a.sub SCRIPT PRE A before-script $JOB SCRIPT POST A after-script $JOB $RETURN

So What? Pre script can make decisions Where should my job run? (Particularly useful to make job run in same place as last job.) What should my job do? Generate Sub-DAG Post script can change return value DAGMan decides job failed in non-zero return value Post-script can look at {error code, output files, etc} and return zero or non-zero based on deeper knowledge.

SPLICEs: DAGs as subroutines

SPLICEs: DAGs as subroutines

A B C D A B C D A B C D E

SPLICE Syntax SPLICE name dagfile.dag Creates new node with dag as node CHILD / PARENT / etc all work on ndoes

Example JOB E E.submit SPLICE DIAMOND diamond.dag PARENT DIAMOND CHILD E

Let’s try it out! Exercises with DAGMan.

Questions? Questions? Comments? Feel free to ask me questions later: