What’s New in DAGMan HTCondor Week 2013

Slides:



Advertisements
Similar presentations
EGEE is a project funded by the European Union under contract IST EGEE Tutorial Turin, January Hands on Job Services.
Advertisements

Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
DAGMan Hands-On Kent Wenger University of Wisconsin Madison, Madison, WI.
Slide 1 of 10 Job Event Basics A Job Event is the name for the collection of components that comprise a scheduled job. On the iSeries a the available Job.
Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.
Server selection Multiple servers Add a server UDN selection Channel selection Time selection Duration selection Channel window Time window Current time.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
1 Shell Programming – Extra Slides. 2 Counting the number of lines in a file #!/bin/sh #countLines1 filename=$1#Should check if arguments are given count=0.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
E-infrastructure shared between Europe and Latin America 1 Workload Management System-WMS Luciano Diaz Universidad Nacional Autónoma de México - UNAM Mexico.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
If condition1 then statements elif condition2 more statements […] else even more statements fi.
Group, group, group One after the other: cmd1 ; cmd2 One or both: cmd1 && cmd2 Only one of them: cmd1 || cmd2 Cuddling (there):( cmd1 ; cmd2 ) Cuddling.
HUBbub 2013: Developing hub tools that submit HPC jobs Rob Campbell Purdue University Thursday, September 5, 2013.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
1 Getting Started with OSG Connect ~ an Interactive Tutorial ~ Emelie Harstad, Mats Rynge, Lincoln Bryant, Suchandra Thapa, Balamurugan Desinghu, David.
HTCondor and Workflows: Tutorial HTCondor Week 2016 Kent Wenger.
Exploreengage elevate explore engage elevate Presented By: Laura Murphy, Turnkey Technologies.
Improvements to Configuration
Intermediate HTCondor: More Workflows Monday pm
Condor DAGMan: Managing Job Dependencies with Condor
Operations Support Manager - Open Science Grid
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Intermediate HTCondor: Workflows Monday pm
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Examples Example: UW-Madison CHTC Example: Global CMS Pool
WP1 WMS release 2: status and open issues
An Introduction to Workflows with DAGMan
Workload Management System ( WMS )
HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.
Chapter 11 – Processes and Services
Tango Administrative Tools
Things you may not know about HTCondor
I2G CrossBroker Enol Fernández UAB
Grid Compute Resources and Job Management
CREAM-CE/HTCondor site
Using Stork An Introduction Condor Week 2006
Unix Scripting Session 4 March 27, 2008.
Software Testing With Testopia
UNIX PROCESSES.
HTCondor and Workflows: An Introduction HTCondor Week 2013
Troubleshooting Your Jobs
University of Wisconsin-Milwaukee
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
GENIUS Grid portal Hands on
CST8177 Scripting 2: What?.
Condor-G Making Condor Grid Enabled
Frieda meets Pegasus-WMS
Submitting with Subway
Presentation transcript:

What’s New in DAGMan HTCondor Week 2013 Kent Wenger

DAGMan Introduction A B C D Directed Acyclic Graph Manager HTCondor’s workflow management tool User specifices dependencies between jobs, DAGMan manages them automatically Features to handle many variations/special cases Scales to very large workflows Handles many error conditions See Tuesday tutorial slides 2

Workflow log file Greatly reduces DAGMan’s file descriptor usage User log names in submit files can include macros Control with DAGMAN_ALWAYS_USE_NODE_LOG, -dont_use_default_node_log New in 7.9.0 Must be disabled with pre-7.9.0 schedd Bug caused HTCondor-C jobs to fail (7.9.0-7.9.5) If dagman_log is specified in submit file, DAGMan’s file “wins” 3

Top-level VARS setting applies to splices foo.dag: Splice bar bar.dag Vars bar+baz state="Wisconsin" bar.dag Job baz baz.sub baz.sub: arguments = $(state) Not for sub-DAGs New in 7.9.0 4

Set job attributes w/ VARS Begin macroname with a “+” character to define a ClassAd attribute For example, the following VARS specification: Vars NodeE +A="\"bob\"" would allow the HTCondor submit description file for NodeE to use the following line: arguments = "$$([A])" Like +A=“bob” in submit file Doesn’t work for scheduler universe New in 7.9.4 5

Suppressing emails from node jobs Config: DAGMAN_SUPPRESS_NOTIFICATION Command line: -suppress_notification -dont_suppress_notification Default is suppressing notification New in 7.9.1 Default for all jobs became NEVER in 7.9.2 6

Status in DAGMan’s ClassAd > condor_q -l 59 | grep DAG_ DAG_Status = 0 DAG_InRecovery = 0 DAG_NodesUnready = 1 DAG_NodesReady = 4 DAG_NodesPrerun = 2 DAG_NodesQueued = 1 DAG_NodesPostrun = 1 DAG_NodesDone = 3 DAG_NodesFailed = 0 DAG_NodesTotal = 12 Sub-DAGs count as one node New in 7.9.5 7

More info in node status file … Nodes total: 12 Nodes done: 8 Nodes pre: 0 Nodes queued: 3 Nodes post: 0 Nodes ready: 0 Nodes un-ready: 1 Nodes failed: 0 New in 7.9.3 8

DAGMAN_USE_STRICT defaults to 1 Questionable settings that might cause subtle problems become immediate fatal errors (instead of just warnings) DAGMAN_USE_STRICT range is 0-3 Default setting of 1 is new in 7.9.4 (was 0) 9

Log files in /tmp are errors Default node log or individual job logs Can cause DAG to fail (because /tmp may get cleared out) Setting DAGMAN_USE_STRICT to 0 allows DAG to run (dangerously) New in 7.9.4 10

Minor changes DAGMAN_LOG_ON_NFS_IS_ERROR is ignored when both CREATE_LOCKS_ON_LOCAL_DISK and ENABLE_USERLOG_LOCKING are True DAGMan will now try twice to write a POST script terminated event, rather than trying once and exiting 11

Relevant Links DAGMan: http://research.cs.wisc.edu/htcondor/dagman/dagman.html For more questions: htcondor-admin@cs.wisc.edu 12