Working with Condor. Links: Condor’s homepage:  Condor manual (for the version currently.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Cluster Computing at IQSS Alex Storer, Research Technology Consultant.
The Web Warrior Guide to Web Design Technologies
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Introduction to C Programming
Introduction to C Programming
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL.
Introduction to Operating Systems – Windows process and thread management In this lecture we will cover Threads and processes in Windows Thread priority.
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
 2007 Pearson Education, Inc. All rights reserved Introduction to C Programming.
Guide To UNIX Using Linux Third Edition
Introduction to C Programming
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
 Value, Variable and Data Type  Type Conversion  Arithmetic Expression Evaluation  Scope of variable.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Grid Computing I CONDOR.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
Review of Condor,SGE,LSF,PBS
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Lecture 02 File and File system. Topics Describe the layout of a Linux file system Display and set paths Describe the most important files, including.
1 CSC160 Chapter 1: Introduction to JavaScript Chapter 2: Placing JavaScript in an HTML File.
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
Chapter 16 Advanced Bourne Shell Programming. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. Objectives To discuss numeric data processing.
1 Agenda  Unit 7: Introduction to Programming Using JavaScript T. Jumana Abu Shmais – AOU - Riyadh.
Linux Administration Working with the BASH Shell.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor DAGMan: Managing Job Dependencies with Condor
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Object Oriented Programming
Variables ICS2O.
Job Matching, Handling, and Other HTCondor Features
Basic Grid Projects – Condor (Part I)
T. Jumana Abu Shmais – AOU - Riyadh
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
SPL – PS1 Introduction to C++.
Presentation transcript:

Working with Condor

Links: Condor’s homepage:  Condor manual (for the version currently used): 

Table of contents Condor overview Usefull Condor commands Vanilla universe Macros Standard universe Java universe Matlab in Condor ClassAds DagMan

Condor overview Condor is a system for running lots of jobs on a (preferably large) cluster of computers. Condor is a specialized workload management system for compute- intensive jobs.

Condor overview Condor’s inner structure:  Condor is built of several daemons: condor_master: This daemon is responsible for keeping all the rest of the Condor daemons running condor_startd: This daemon represents a given machine to the Condor pool. It advertises attributes about the machine it’s running on. Must run on machines accepting jobs. condor_schedd: This daemon is responsible for submitting jobs to condor. It manages the job queue (each machine has one!). Must run on machines submitting jobs. condor_collector: Runs only on the condor server. This daemon is responsible for collecting all the information about the status of a Condor pool. All other daemons periodically sends updates to the collector. condor_negotiator: Runs only on the condor server. This daemon is responsible for all the match-making within the Condor system. condor_ ckpt_server: Runs only on the checkpointing server. This is the checkpoint server. It services requests to store and retrieve checkpoint files.

Condor overview Condor uses user priorities to allocate machines to users in a fair manner.  A lower numerical value for user priority means higher priority.  Each user starts out with the best user priority, 0.5.  If the number of machines a user currently has is greater then his priority, then his user priority will worsen (numerically increase) over time.  If the number of machines a user currently has is lower then his priority, then priority will improve over time. Use condor_userprio {-allusers} to see user priorities

Usefull Condor commands condor_status  Shows all of the computers connected to condor (not all are accepting jobs)  Usefull arguments: -claimedshows only machines running condor jobs ( and who runs them). -availableshows only machines which are willing to run jobs now -longdisplay entire classads. (discussed later on) -constraint show only resources matching the given.

Usefull Condor commands condor_status  Attributes Arch:  INTELmeans a 32bit linux  X86_64means a 64bit linux Activity:  “Idle”There is no job activity  “Busy”A job is busy running  “Suspended”A job is currently suspended  “Vacating”A job is currently checkpointing  “Killing”A job is currently being killed  “Benchmarking”The startd is running benchmarks

Usefull Condor commands condor_status  More attributes State:  “Owner” The machine owner is using the machine, and it is unavailable to Condor.  “Unclaimed”The machine is available to run Condor jobs, but a good match is either not available or not yet found.  “Matched”The Condor central manager has found a good match for this resource, but a Condor scheduler has not yet claimed it.  “Claimed”The machine is claimed by a remote machine and is probably running a job.  “Preempting”A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.

Usefull Condor commands condor_q  Shows state of jobs submitted from the calling computer (the one running condor_q)  Usefull arguments: -analyzePerform schedulability analysis on jobs. Usefull to see why a scheduled job isn’t running, and if it’s ever going to run. -dagSort DAG jobs under their DAGMan -constraint (classads) -global ( -g )get the global queue. -runget information about running jobs.

Usefull Condor commands condor_rm  Removes a scheduled job from the queue (of the scheduling computer).  condor_rm cluster.proc Remove the given job  condor_rm cluster Remove the given cluster of jobs  condor_rm user Remove all jobs owned by user  condor_rm –all Remove all jobs

Vanilla universe jobs Vanilla universe is used for running jobs without special needs and features. In Vanilla universe Condor runs the job the same as it would run without Condor Start with a simple example.c: #include int main(){ printf(“hello condor”); return 0; } Compile as usual: gcc example.c –o example

Vanilla universe jobs In order to submit the job to Condor we use the condor_submit command. Usage: condor_submit A simple submit file (sub_example): Universe = Vanilla Executable = example Log = test.log Output = test.out Error = test.error Queue Notice that the submission commands are case insensitive.

Vanilla universe jobs There are a few other usefull commands arguments = arg1 arg2 …  run the executable with the given arguments Input =  The file given is used as standard input environment = “ = = …”  Runs the job with the given environment variables.  In order to use spaces in the entry use single quote  To insert quotation use double quote mark, example: environment = “ a=“”quote”” b=‘a ‘’b’’ c’ ”

Vanilla universe jobs getenv =  If getenv is set to True, then condor_ submit will copy all of the user's current shell environment variables at the time of job submission into the job ClassAd. The job will therefore execute with the same set of environment variables that the user had at submit time.  Defaults to False.

Vanilla universe jobs A more advanced submission: Universe = Vanilla Executable = example Log = test.$(cluster).$(process).log Output = test.$(cluster).$(process).out Error = test.$(cluster).$(process).error Queue 7 Here we see a use of predefined macros. ‘cluster’ gives us the value of the ClusterId job ClassAd attribute, and the $(process) macro supplies the value of the ProcId job ClassAd attribute

Macros More on Macros:  A macro is defined as follows: = string  It can be then used by writing $(macro_name)  $$(attribute) is used to get a classad attribute from the machine running the job.  $ENV(variable) gives us the environment variable ‘variable’ from the machine running the job.  For more on macros go to condor’s manual, condor_submit section.

Other universes Standard universe Java universe

Standard universe The Standard universe provides checkpointing and remote system calls. Remote system calls:  All system calls made by the job running in Condor are made on the submitting computer. Chekpointing:  Save a snapshot of the current state of the running job, so the job can be restarted from the saved state in case of: Migration to another computer Machine crash or failureSS

Standard universe In order to execute a program in the Standard universe it must be relinked with the Condor’s library. To do so use condor_compile with your usual link command. Example:  condor_compile gcc example.c To manually cause a checkpoint use condor_checkpoint hostname There are some restrictions on jobs running in the standard universe:

Standard universe - restrictions Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system(). Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory. Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration. Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed. Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().

Standard universe - restrictions Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed. Memory mapped files are not allowed. This includes system calls such as mmap() and munmap(). File locks are allowed, but not retained between checkpoints. All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error. Your job must be statically linked (On Digital Unix (OSF/1), HP-UX, and Linux, and therefore on our school). Reading to or writing from files larger than 2 GB is not supported.

Java universe Used to run java programs Example submit description file: universe = java executable = Example.class arguments = Example output = Example.output error = Example.error queue Notice that the first argument is the main class of the job. The JVM must be informed when submitting jar files, this is done in the following way:  jar_files = example.jar To run on a machine with a specific java version:  Requirements = (JavaVersion==“1.5.0_01”) Options to the Java VM itself can be set in the submit description file:  java_vm_args = -DMyProperty=Value -verbose:gc …  These options go after the java command but before the main class (Usage: java [options] class [args...]). Do not use this to set the classpath (condor handles that itsef).

Matlab Functions Matlab functions/scripts are written in.m files. Structure: function {ret_var = } func_name(arg1, arg2, …) …

Running Matlab functions in condor First method: Calling matlab  What we want to do is run: matlab -nodisplay -nojvm -nosplash –r ‘func(arg1, arg2, …)’  Instead of transferring the matlab executable we’ll write a script (run.csh): #!/bin/csh –f matlab -nodisplay -nojvm -nosplash -r "$*"

Running Matlab functions in condor First method: Calling matlab  The submission file: executable = run.csh log = mat.log error = mat.error output = mat.output universe = vanilla getenv = True arguments = func(arg1, arg2, …) queue 1  Notice that in order to run matlab we must set getenv = true

Running Matlab functions in condor Second method: Compiling the function  First, we compile our Matlab script, example.m, into an executable: mcc –mv example.m  The –v option is not mandatory. It is used to show details in the process of compilation.  The files required for running will be “example” nad example.ctf  The compiled function requires matlab’s shared libraries in order to run.  So, we’ll send Condor a script which defines the necessary env. Variables and then runs the executable.

Running Matlab functions in condor Second method: Compiling the function  The script: #!/bin/tcsh setenv LD_LIBRARY_PATH /usr/local/stow/matlab R14SP2/lib/matlab R14SP2/bin/glnx86:/usr/local/stow/matlab R1 4SP2/lib/matlab R14SP2/sys/os/glnx86:/usr/local/stow/matlab R14SP2/lib/matlab R14SP2/sys/java/jre/glnx86/jr e1.4.2/lib/i386/client:/usr/local/stow/matlab R14SP2/lib/matlab R14SP2/sys/java/jre/glnx86/jre1.4.2/lib/i386:/usr /local/stow/matlab R14SP2/lib/matlab R14SP2/sys/opengl/lib/glnx86: setenv XAPPLRESDIR /usr/local/stow/matlab R14SP2/lib/matlab R14SP2/X11/app-defaults setenv LD_PRELOAD /lib/libgcc_s.so.1./multi $1 $2

ClassAds ClassAds are a flexible mechanism for representing the characteristics and constraints of machines and jobs in the Condor system Condor acts as a matchmaker for ClassAds. ClassAds are analogous to the classified advertising section in a newspaper. All machines running Condor advertise their attributes. A machine also advertises under what conditions it is willing to run a job, and what type of job it would prefer. When submitting a job, you specify your requirements and preferences. These attributes are bundled up into a job ClassAd.

ClassAds ClassAd expressions are formed by composing literals, attribute references and other sub-expressions with operators and functions  Literals: may be integers (including TRUE – 1 and FALSE – 0) Real String, a list of characters between two double quote chars. Use \ to include the following char in the string, irrespective of what that character is. UNDEFINED keyword (case insensitive) ERROR keyword (case insensitive)

ClassAds  Attributes A pair (name, expression) is called an attribute. The attribute name is case insensitive. An optional scope resolution prefix may be added: “MY.” and “TARGET.” MY. refers to an attribute defined in the current ClassAd. TARGET. Refers to an attribute defined in the ClassAd in which the current ClassAd is evaluated. If no scope prefix is given, the first try “MY.”, if not found try “TARGET.”, if not found try the ClassAd environment, if not found then value is UNDEFINED. If there is a circular dependency between two classads (e.g. A uses B and B uses A) then the value is ERROR

ClassAds  Operators The operators are similar to c language. All operators are case insensitive for strings, with the following exeptions:  =?=“is identical to” operator (similar to ==)  =!=“is not identical to” operator (similar to !=) Precedence:

ClassAds  Predefined functions Examples:  Integer strcmp(AnyType Expr1, AnyType Expr2)  String strcat(AnyType Expr1 [, AnyType Expr2... ])  Boolean isInteger(AnyType Expr) Function names are case insensitive For a full list of the functions refer to the user manual, section

ClassAds When submitting a job, one give requirements which only machines answering them may run the job. One can also rank the machines available to run the job and choose the the highest ranked machine to run the job. This can be done using the Requirements and Rank commands in the submission file.

ClassAds submission commands Requirements =  The job will run on a machine only if the requirements expression evaluates to TRUE on that machine.  Example: requirements = Memory >= 64 && Arch == "intel"  The running machine must have at least 64 MB of ram and the architecture is INTEL.  The computers in our school have two possible architecture names: “INTEL” if it’s a 32bit computer or “X86_64” if it’s a 64bit computer.

ClassAds submission commands By default Condor adds to the requirements of a job the following requirements:  Arch, OpSys the same as the submitting computer.  Disk >= DiskUsage. The DiskUsage attribute is initialized to the size of the executable plus the size of any files specified in a transfer_input_files command.  (Memory * 1024) >= ImageSize. To ensure the target machine has enough memory to run your job.  If Universe is set to Vanilla, FileSystemDomain is set equal to the submit machine's FileSystemDomain. In order to see a submitted job’s requirements (along with everything else about the job) use condor_q –l.

ClassAds submission commands rank =  Sorts all matching machines by the given exression. Condor will give the job the machine with the highest rank.  The expression is a numeric expression (where boolean sub-expressions evaluate to 1.0 or 0.0)

DagMan Use a directed acyclic graph (DAG) to represent a set of jobs to be run in a certain order. A basic DAG submit file: JOB name1 submit_file1 JOB name2 submit_file2 …  If “DONE” is specified in the end of a JOB line then that job is considered complete and is not submitted.

DagMan Additional dag commands:  SCRIPT: Sets processing to be done before/after running the job. These “scripts” run on the submitting machine. SCRIPT PRE job_name executable [arguments]  Runs the executable before job_name is submitted SCRIPT POST job_name executable [arguments]  Runs the executable after job_name has completed its execution under Condor.

DagMan Additional dag commands:  PARENT … CHILD Used to describe the dependencies between the jobs. PARENT p1 p2 … CHILD c1 c2 …  Makes all p i ’s parents of all c i ’s (i.e. the c i ’s will be submitted only after all p i ’s have completed their execution)  RETRY RETRY jobName NumOfRetries [UNLESS-EXIT value]  If job fails runs runs again at most NumOfRetries times.  If UNLESS-EXIT is specified and the value returned equals “value” then no further retries will be attempted.

DagMan Additional dag commands:  VARS Defines macros that can be used in the submit description file of a job. VARS jobName macroname= “string” [macroname2= “string” …]  ABORT-DAG-ON Aborts the entire DAG if a specific node returns a specific value. Stops all nodes within the DAG immediately. This includes nodes currently running. ABORT-DAG-ON JobName AbortExitValue [RETURN DAGReturnValue] By default the returned value of the DAG is the value returned from the aborted node. If RETURN is specified then the return value is DAGReturnValue

DagMan Example DAG file: JOB A a.submit JOB B b.submit JOB C a.submit PARENT A CHILD B C RETRY C 3 ABORT-DAG-On A 2 Submission of DAG’s is done with: condor_submit_dag file.dag In order to specify the max number of jobs submitted by the DagMan add the argument:  -maxjobs numOfJobs If any node in a DAG fails,  The DagMan continues to run the reminder of the nodes untill no more forward progress can be made.  Then it creates a rescue file (input_file.rescue), where for each node that completed its execution the corresponding JOB line ends with DONE. Submitting this file continues DAG execution.

DagMan It is possible to create a visualization of the DAG:  Add a line to the DAG file with: “DOT dot_file.dot”  Submit the DAG  dot -Tps dot_file.dot -o dag.ps A DAG inside a DAG:  Suppose you want to include inner.dag in outer.dag  Execute condor_submit_dag -no_submit inner.dag  Include the following “JOB” line in outer.dag: JOB jobName inner.dag.condor.sub inner.dag.condor.sub is the submission file for inner.dag