Alain Roy Computer Sciences Department University of Wisconsin-Madison 24-June-2002 Using and Administering Condor
Добрый вечер! › Thank you for having me! › I am: Alain Roy Computer Science Ph.D. in Quality of Service, with Globus Project Working with the Condor Project
Condor Tutorials Remaining › Monday (Today)17:00-19:00 Using and administering Condor › Tuesday17:00-19:00 Using Condor on the Grid
Review: What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility. Run lots of jobs over a long period of time, Not a short burst of “high-performance” › Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy
Condor Takes Care of You › Condor does whatever it takes to run your jobs, even if some machines… Crash (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & managed by someone else
What is Unique about Condor? › ClassAds › Transparent checkpoint/restart › Remote system calls › Works in heterogeneous clusters › Clusters can be: Dedicated Opportunistic
What’s Condor Good For? › Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and sends you when they complete Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. Condor can handle inter-job dependencies (DAGMan)
What’s Condor Good For? (cont’d) › Robustness Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion If an execute machine crashes, you only lose work done since the last checkpoint Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover (Story)
What’s Condor Good For? (cont’d) › Giving your job the agility to access more computing resources Checkpointing allows your job to run on “opportunistic resources” (not dedicated) Checkpointing also provides “migration” - if a machine is no longer available, move! With remote system calls, run on systems which do not share a filesystem - You don’t even need an account on a machine where your job executes
Other Condor features › Implement your policy on when the jobs can run on your workstation › Implement your policy on the execution order of the jobs › Keep a log of your job activities
A Condor Pool In Action
A Bit of Condor Philosophy › Condor brings more computing to everyone A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done. A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines.
The Idea Computing power is everywhere, we try to make it usable by anyone.
Remember Frieda? Today we’ll revisit Frieda’s Condor explorations in more depth
I have 600 simulations to run. Where can I get help?
Install a Personal Condor!
Installing Condor › Download Condor for your operating system › Available as a free download from › Available for most Unix platforms and Windows NT
So Frieda Installs Personal Condor on her machine… › What do we mean by a “Personal” Condor? Condor on your own workstation, no root access required, no system administrator intervention needed—easy to set up.
Personal Condor?! What’s the benefit of a Condor “Pool” with just one user and one machine?
Your Personal Condor will... › Keep an eye on your jobs and will keep you posted on their progress › Keep a log of your job activities › Add fault tolerance to your jobs › Implement your policy on when the jobs can run on your workstation
What’s in a Personal Condor? › Everything that is in Condor, just one machine. › Condor daemons: Condor_master Condor_collector—Stores ClassAds for jobs, machines Condor_negotiator—Matchmaking Condor_schedd—Submits, monitors jobs Condor_startd—Starts jobs Condor_starter—Launches a job Condor_shadow—Monitors remote job
A Condor Pool of One Condor_master Condor_schedd Condor_collector Condor_negotiator Condor_startd Condor_starter Condor job Condor_shadow
condor_master › Starts up all other Condor daemons › If there are any problems and a daemon exits, it restarts the daemon and sends to the administrator › Checks the time stamps on the binaries of the other Condor daemons, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version
condor_master (cont’d) › Acts as the server for many Condor remote administration commands: condor_reconfig, condor_restart, condor_off, condor_on, condor_config_val, etc.
condor_startd › Represents a machine to the Condor system › Responsible for starting, suspending, and stopping jobs › Enforces the wishes of the machine owner (the owner’s “policy”… more on this soon)
condor_schedd › Represents users to the Condor system › Maintains the persistent queue of jobs › Responsible for contacting available machines and sending them jobs › Services user commands which manipulate the job queue: condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio, …
condor_collector › Collects information from all other Condor daemons in the pool “Directory Service” / Database for a Condor pool › Each daemon sends a periodic update called a “ClassAd” to the collector › Services queries for information: Queries from other Condor daemons Queries from users (condor_status)
condor_negotiator › Performs “matchmaking” in Condor › Gets information from the collector about all available machines and all idle jobs › Tries to match jobs with machines that will serve them › Both the job and the machine must satisfy each other’s requirements
Frieda wants more… › She decides to use the graduate students’ computers when they aren’t, and get done sooner. › In exchange, they can use the Condor pool too.
Frieda’s Condor pool… Frieda’s Computer: Central Manager Graduate Student’s Desktop Computers
A larger Condor pool Submitter Condor_master Condor_schedd Condor_shadow Collector Condor_master Condor_negotiator Condor_collector Submitter/Executor Condor_master Condor_scheddCondor_startd Condor_shadowCondor_starter Condor Job Executor Condor_master Condor_startd Condor_starter Condor Job
Happy Day! Frieda’s organization purchased a Beowulf Cluster! › Other scientists in her department have realized the power of Condor and want to share it.. › The Beowulf cluster and the graduate student computers can be part of a single Condor pool.
Frieda’s Condor pool… Central Manager Graduate Student’s Desktop Computers Beowulf Cluster
How would you set it up? › Grad student machines: Submitters Executors › Beowulf cluster machines Executors only › Independent machine for collector/neg Big job—take it away from Freida’s computer Could split collector and negotiator
Frieda collaborates… › She wants to share her Condor pool with scientists from another lab.
Condor Flocking › Condor pools can work cooperatively
How would you set it up? › Two independent pools Each has it’s own collector/negotiator › Set up flocking from one pool to another: by machine, or by pool. FLOCK_TO FLOCK_FROM › Can be uni- or bi-directional
Questions So Far?
How do you run a job? › It doesn’t matter if you have: Personal Condor Large Condor pool Condor pool with flocking › Four steps 1. Write program 2. Write submit file 3. Give it to Condor 4. Condor gives you the results
Step 1: Writing a program › Condor has universes Vanilla Universe: Run anything Less capable Java Universe: Works better for Java Standard Universe: Checkpointing Remote I/O Can’t work with all programs
Step 1: Vanilla Universe › You can run any program C/C++/Perl/Python/Fortran/Java/Lisp… No checkpointing: if your job is interrupted or the machine crashes, Condor has to restart it from the beginning. Can do anything you could do if you were logged in.
Step 1: Java Universe › Works better for Java programs › Checks for valid Java environment › Distinguishes Java environment exceptions from program exceptions (wrapper program) › No checkpointing (it could happen though) › Remote I/O
Step 1: Standard Universe › Requires re-linking your program condor_compile gcc –o simple simple.o › Allows checkpointing and remote I/O › Restrictions on behavior No threading Limited networking Restrictions on compiler used
Step 2: Write submit file Executable = simple Universe = vanilla Arguments = First Log = simple.log Output = simple.output Error = simple.error Requirements = Memory > 512 Queue Note: This assumes a shared filesystem
Step 2: Write submit file Executable = simple Universe = vanilla Arguments = First Log = simple.log Output = simple.output Error = simple.error Transfer_input_files = data.in Transfer_output_files = data.out Requirements = Memory > 512 Queue Note: This does not assume a shared filesystem
Step 2: Write submit file Executable = simple Universe = standard Arguments = First Log = simple.log Output = simple.output Error = simple.error Requirements = Memory > 512 Queue Note: This does not assume a shared filesystem, but remote I/O
Step 2: Submit Files › Condor is helpful: it makes a real requirements: Requirements = memory > 512 becomes… Requirements = (OpSys == “Linux”) && (memory > 512) && … › Queue can take a parameter (more later) › A single file can submit many jobs
Step 3: Give it to Condor › condor_submit submit.desc › condor_q -- Submitter: dsonokwa.cs.wisc.edu : : dsonokwa.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 roy 6/15 20: :00:02 R simple First 1 jobs; 0 idle, 1 running, 0 held
Step 4: Condor gives it back › The program’s output is where you asked it to be. › Condor left a log file documenting what it did. › Condor optionally sends you an telling you it’s done.
Step 4: Condor gives it back 000 ( ) 06/15 21:00:01 Job submitted from host: 001 ( ) 06/15 21:00:01 Job executing on host: 005 ( ) 06/15 21:00:06 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
Step 4: Condor gives it back Date: Sat, 15 Jun :00: (CDT) From: Condor Project Message-Id: To: Subject: [Condor] Condor Job This is an automated from the Condor system on machine "beak.cs.wisc.edu". Do not reply. Your condor job exited with status 0. Job: /scratch/roy/condor/simple/simple First
Clusters and Processes › If your submit file describes multiple jobs, we call this a “cluster”. › Each job within a cluster is called a “process” or “proc”. › If you only specify one job, you still get a cluster, but it has only one process. › A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”) › Process numbers always start at 0.
Example Submit Description File for a Cluster # Example condor_submit input file that defines # a whole cluster of jobs at once Universe = standard Executable = simple Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/roy/condor/run.$(Process) Queue 500
Questions So Far?
condor_q › Find out status of your jobs, from your condor_schedd. › condor_q cluster: all jobs in a cluster › condor_q cluster.proc: particular job › condor_q –sub name: jobs for a particular user
Temporarily halt a Job › Use condor_hold to place a job on hold Kills job if currently running Will not attempt to restart job until released › Use condor_release to remove a hold and permit job to be scheduled again
condor_rm › You submitted a job, but you want to cancel it › condor_rm clusterid Condor_rm 6: all jobs in cluster › condor_rm clusterid.procid condor_rm 6.3: specific job › condor_rm –all: all of your jobs › Can only remove your jobs › Reflected in job log
condor_status › Find status of pool from condor_collector (simplified view here) Name OpSys Arch State Activity carmi.cs.wisc LINUX INTEL Unclaimed Idle coral.cs.wisc LINUX INTEL Unclaimed Idle doc.cs.wisc.e LINUX INTEL Unclaimed Idle dsonokwa.cs.w LINUX INTEL Unclaimed Idle... Machines Owner Claimed Unclaimed LINUX SOLARIS Total
condor_status › condor_status –run: which machines are running jobs › condor_status –sub: whose jobs are running? › condor_status –constraint: restrict to showing subset as defined by user
DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)
What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D
Defining a DAG › A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D › each node will run the Condor job specified by its accompanying Condor submit file Job A Job BJob C Job D
Submitting a DAG › To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. › Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.
DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue C D A A B.dag File
DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue C D B C B A
DAGMan Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File
DAGMan Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C
DAGMan Recovering a DAG (cont’d) › Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor Job Queue C D A B D
DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B
Additional DAGMan Features › Provides other handy features for job management… nodes can have PRE & POST scripts failed nodes can be automatically re- tried a configurable number of times job submission can be “throttled”
Questions So Far?
What if each job needed to run for 20 days? What if I wanted to interrupt a job with a higher priority job?
Condor’s Standard Universe to the rescue! › Condor can support various combinations of features/environments in different “Universes” › Different Universes provide different functionality for your job: Vanilla—runs any Serial Job Java—well suited for Java programs Standard – Support for transparent process checkpoint and restart
Process Checkpointing › Condor’s Process Checkpointing mechanism saves all the state of a process into a checkpoint file Memory, CPU, I/O, etc. › The process can then be restarted from right where it left off › Typically no changes to your job’s source code needed – however, your job must be relinked with Condor’s Standard Universe support library
Linking for Standard Universe To do this, just place “condor_compile” in front of the command you normally use to link your job: condor_compile gcc -o myjob myjob.c OR condor_compile f77 -o myjob filea.f fileb.f
Limitations in the Standard Universe › Condor’s checkpointing is not at the kernel level. Thus in the Standard Universe the job may not Fork() Use kernel threads Use some forms of IPC, such as pipes and shared memory › Many typical scientific jobs are OK
When will Condor checkpoint your job? › Periodically, if desired For fault tolerance › To free the machine to do a higher priority task (higher priority job, or a job from a user with higher priority) Preemptive-resume scheduling › When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command
Administering Condor › Condor provides extensive configuration files One per pool, one per machine, or anything in between › Extensive documentation Online manual Heavily commented sample configuration file
I am adding nodes to the Cluster… but the Chemistry Department has priority on these nodes. (Boss Fat Cat) Policy Configuration
The Machine (Startd) Policy Expressions START – When is this machine willing to start a job RANK - Job Preferences SUSPEND - When to suspend a job CONTINUE - When to continue a suspended job PREEMPT – When to nicely stop running a job KILL - When to immediately kill a preempting job
Freida’s Current Settings START = True RANK = SUSPEND = False CONTINUE = PREEMPT = False KILL = False
Freida’s New Settings for the Chemistry nodes START = True RANK = Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False
Submit file with Custom Attribute Executable = chem-job Universe = standard +Department = Chemistry queue
What if “Department” not specified? START = True RANK = Department =!= UNDEFINED && Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False
Another example START = True RANK = Department =!= UNDEFINED && ((Department == “Chemistry”)*2 + Department == “Physics”) SUSPEND = False CONTINUE = PREEMPT = False KILL = False
The Cluster is fine. But not the desktop machines. Condor can only use the desktops when they would otherwise be idle. (Boss Fat Cat) Policy Configuration, cont
So Frieda decides she wants the desktops to: › START jobs when their has been no activity on the keyboard/mouse for 5 minutes and the load average is low › SUSPEND jobs as soon as activity is detected › PREEMPT jobs if the activity continues for 5 minutes or more › KILL jobs if they take more than 5 minutes to preempt
Macros in the Config File NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) BackgroundLoad = 0.3 HighLoad = 0.5 KeyboardBusy = (KeyboardIdle < 10) CPU_Idle = ($(NonCondorLoadAvg) <= $(Background)) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer= (CurrentTime - EnteredCurrentActivity)
Desktop Machine Policy START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND= $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT= (Activity == "Suspended") && $(ActivityTimer) > 300 KILL = $(ActivityTimer) > 300
Policy Review › Users submitting jobs can specify Requirements and Rank expressions › Administrators can specify Startd Policy expressions individually for each machine (Start,Suspend,etc) › Expressions can use any job or machine ClassAd attribute › Custom attributes easily added › Bottom Line: Enforce almost any policy!
Administrator Commands › condor_vacateLeave a machine now › condor_onStart Condor › condor_offStop Condor › condor_reconfigReconfig on-the-fly › condor_config_valView/set config › condor_userprioUser Priorities › condor_statsView detailed usage accounting stats
Questions So Far?
Security in Condor › Since version 6.3.3, Condor has greatly improved security › Multiple authentication methods: X509 (Using GSI) Kerberos Filesystem (shared filesystem, known user) › Encryption: 3DES Blowfish
Security in Condor › Authentication Based on users, with optional wildcards Users can be given different permissions: Read Write Administrator Config
Version Numbers in Condor › Odd minor numbers are development releases: 6.3.1, 6.3.2, 6.5.0… Compatibility not guaranteed within a series, like 6.3.x. › Even minor numbers are stable releases 6.2.2, 6.4.0, 6.4.1… Compatibility guaranteed within a series, like 6.4.x.
Questions? Comments? › Web: ›