Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

Slides:



Advertisements
Similar presentations
Community Grids Lab1 CICC Project Meeting VOTable Developed VotableToSpreadsheet Service which accepts VOTable file location as an input, converts to Excel.
Advertisements

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.
DAGMan Hands-On Kent Wenger University of Wisconsin Madison, Madison, WI.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
SEE-GRID-SCI Hands-On Session: Workload Management System (WMS) Installation and Configuration Dusan Vudragovic Institute of Physics.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Grid Computing I CONDOR.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
1 Using Condor An Introduction ICE 2010.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
Grid job submission using HTCondor Andrew Lahiff.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Working with Condor. Links: Condor’s homepage:  Condor manual (for the version currently.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor and Workflows: Tutorial HTCondor Week 2016 Kent Wenger.
Intermediate HTCondor: More Workflows Monday pm
Condor DAGMan: Managing Job Dependencies with Condor
Operations Support Manager - Open Science Grid
Intermediate HTCondor: Workflows Monday pm
An Introduction to Workflows with DAGMan
Using Stork An Introduction Condor Week 2006
Job HTCondor’s quanta of work, like a Unix process Has many attributes
HTCondor and Workflows: An Introduction HTCondor Week 2013
Troubleshooting Your Jobs
What’s New in DAGMan HTCondor Week 2013
Haiyan Meng and Douglas Thain
Condor Glidein: Condor Daemons On-The-Fly
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Troubleshooting Your Jobs
Presentation transcript:

Condor DAGMan Warren Smith

12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each node has a Condor job –Edge edge is a dependency Example DAGMan file: Job A setup.condor Job B sweep1.condor Job C sweep2.condor Job D analyze.condor Parent A Child B C Parent B C Child D Job is used to name condor submit scripts Parent/Child specifies dependencies Node D Node A Node BNode C

12/11/2009 TeraGrid Science Gateways Telecon3 Managing a DAG condor_submit_dag –Creates a local job to manage the DAG Monitors jobs that make up the DAG Submits jobs when dependencies are satisfied lslogin2% condor_submit_dag example8.dag Checking all your submit files for log file names. This might take a while... Done File for submitting this DAG to Condor : example8.dag.condor.sub Log of DAGMan debugging messages : example8.dag.dagman.out Log of Condor library output : example8.dag.lib.out Log of Condor library error messages : example8.dag.lib.err Log of the life of condor_dagman itself : example8.dag.dagman.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster condor_q -dag and condor_q to monitor condor_rm to remove the DAG or individual jobs in the DAG

12/11/2009 TeraGrid Science Gateways Telecon4 DAGMan Node Each node can have pre- and post-scripts –Run on your submit system –SCRIPT PRE JobName ExecutableName [arguments] –SCRIPT POST JobName ExecutableName [arguments] Condor job runs if PRE succeeds POST runs if node executes Result of POST is result of node Job A setup.condor Job B sweep1.condor Job C sweep2.condor Job D analyze.condor Script PRE A retrieve.sh Script POST B check.sh $JOB $RETURN Script POST C check.sh $JOB $RETURN Script POST D archive.sh Parent A Child B C Parent B C Child D Node D Node A Node BNode C PRE script POST script Job B

12/11/2009 TeraGrid Science Gateways Telecon5 Managing Failures Retry statement for any node Retry B 3 POST script can analyze what happened and try to correct –Can be used with Retry ABORT-DAG-ON to exit immediately if can’t recover –On the exit code of a node ABORT-DAG-ON B 12 Rescue DAG –Condor executes a DAG as far as it can, even when individual nodes failure –Rescue DAG is generated if DAG didn’t fully complete Includes comments and marks which nodes completed Can be resubimitted as is or edited and submitted

12/11/2009 TeraGrid Science Gateways Telecon6 TeraGrid Condor-G Matchmaking Matchmaking selecting a resource for a job –A job provides requirements and preferences for a host –A resource provides them for jobs –Jobs are paired to resources Satisfy all requirements of both job and resource Optimize preferences of job and resource TeraGrid supports matchmaking of Condor-G jobs –Can be used with DAGMan Can’t express everything you might want –For example, “run job A on the same machine as job B” Available from several TeraGrid systems – matchmaking Your Condor install can be authorized

12/11/2009 TeraGrid Science Gateways Telecon7 System Information You can find information about systems using condor_status –Each row describes a queue on a system lslogin2% condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime … tacc.lonestar.deve LINUX X86_64 Unclaimed Idle :00:00 tacc.lonestar.high LINUX X86_64 Unclaimed Idle :00:00 tacc.lonestar.norm LINUX X86_64 Unclaimed Idle :00:00 tacc.lonestar.seri LINUX X86_64 Unclaimed Idle :00:00 … Load average tries to describe how busy a queue is –(slots used + slots requested) / slots used

12/11/2009 TeraGrid Science Gateways Telecon8 First Job with Matchmaking executable = /bin/hostname arguments = --fqdn transfer_executable = false output = example2.out error = example2.err log = example2.log requirements = (Name=="tacc.lonestar.development") universe = grid x509userproxy=/home/teragrid/tg458637/.globus/userproxy.pem grid_resource = $$(GramResource) globusrsl = (maxWallTime=5)(count=1)(queue=$$(Queue)) queue

12/11/2009 TeraGrid Science Gateways Telecon9 Notes on First Job x509userproxy –Don’t have to specify when not matchmaking –Need to when matchmaking Requirements –What the job requires of any machine it gets matched to –Boolean –Variables “Name” are for the machine being matched to Variables “$$(Queue)” are for the machine being matched to –$$() only needed outside of requirements

12/11/2009 TeraGrid Science Gateways Telecon10 Second Job with Matchmaking executable = /bin/hostname arguments = --fqdn transfer_executable = false output = example2-$(CLUSTER).$(PROCESS).out error = example2-$(CLUSTER).$(PROCESS).err log = example2-$(CLUSTER).$(PROCESS).log requirements = ((Name=="tacc.lonestar.development") || \ (Name=="tacc.ranger.development") || \ (Name=="loni-lsu.queenbee.workq") || \ (Name=="ncsa.abe.debug") || \ (Name=="ncsa.dtf.debug") || \ (Name=="sdsc.dtf.dque") || \ (Name=="purdue.steele.tg_workq")) rank = LoadAvg - CurMatches * 0.25 universe = grid x509userproxy=/home/teragrid/tg458637/.globus/userproxy.pem grid_resource = $$(GramResource) globusrsl = (maxWallTime=5)(count=1)(queue=$$(Queue)) queue 10

12/11/2009 TeraGrid Science Gateways Telecon11 Notes on Second Job $(Cluster) –The ID for the set of jobs submitted by this script $(Process) –The ID (0 - (n-1)) of a job within a cluster Requirements is (mostly) a boolean expression –() to group, || for or, && for and –, >=, ==, != –A few others since expressions actually have 3 values True, false, undefined Rank expression used to identify best machine –Higher rank is better –CurMatches is number of jobs matched to that machine in current round –100 is the max load average queue 10 –Submits 10 copies of this job

12/11/2009 TeraGrid Science Gateways Telecon12 Ranking Machines It’s a bit of an art at this point –Let me know what works for you Let Warren know if additional information is needed Working on providing queue wait time predictions –QBETS predictions have been available –Most likely moving to a new technology over the next few months

12/11/2009 TeraGrid Science Gateways Telecon13 Additional Information Condor User Guide – TeraGrid –Condor-G page –Condor-G matchmaking wiki condorg_userguide