Download presentation
Presentation is loading. Please wait.
1
Alain Roy Computer Sciences Department University of Wisconsin-Madison roy@cs.wisc.edu http://www.cs.wisc.edu/condor 24-June-2002 Using and Administering Condor
2
www.cs.wisc.edu/condor Добрый вечер! › Thank you for having me! › I am: Alain Roy Computer Science Ph.D. in Quality of Service, with Globus Project Working with the Condor Project
3
www.cs.wisc.edu/condor Condor Tutorials Remaining › Monday (Today)17:00-19:00 Using and administering Condor › Tuesday17:00-19:00 Using Condor on the Grid
4
www.cs.wisc.edu/condor Review: What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility. Run lots of jobs over a long period of time, Not a short burst of “high-performance” › Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy
5
www.cs.wisc.edu/condor Condor Takes Care of You › Condor does whatever it takes to run your jobs, even if some machines… Crash (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & managed by someone else
6
www.cs.wisc.edu/condor What is Unique about Condor? › ClassAds › Transparent checkpoint/restart › Remote system calls › Works in heterogeneous clusters › Clusters can be: Dedicated Opportunistic
7
www.cs.wisc.edu/condor What’s Condor Good For? › Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and sends you email when they complete Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. Condor can handle inter-job dependencies (DAGMan)
8
www.cs.wisc.edu/condor What’s Condor Good For? (cont’d) › Robustness Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion If an execute machine crashes, you only lose work done since the last checkpoint Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover (Story)
9
www.cs.wisc.edu/condor What’s Condor Good For? (cont’d) › Giving your job the agility to access more computing resources Checkpointing allows your job to run on “opportunistic resources” (not dedicated) Checkpointing also provides “migration” - if a machine is no longer available, move! With remote system calls, run on systems which do not share a filesystem - You don’t even need an account on a machine where your job executes
10
www.cs.wisc.edu/condor Other Condor features › Implement your policy on when the jobs can run on your workstation › Implement your policy on the execution order of the jobs › Keep a log of your job activities
11
www.cs.wisc.edu/condor A Condor Pool In Action
12
www.cs.wisc.edu/condor A Bit of Condor Philosophy › Condor brings more computing to everyone A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done. A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines.
13
www.cs.wisc.edu/condor The Idea Computing power is everywhere, we try to make it usable by anyone.
14
www.cs.wisc.edu/condor Remember Frieda? Today we’ll revisit Frieda’s Condor explorations in more depth
15
www.cs.wisc.edu/condor I have 600 simulations to run. Where can I get help?
16
www.cs.wisc.edu/condor Install a Personal Condor!
17
www.cs.wisc.edu/condor Installing Condor › Download Condor for your operating system › Available as a free download from http://www.cs.wisc.edu/condor › Available for most Unix platforms and Windows NT
18
www.cs.wisc.edu/condor So Frieda Installs Personal Condor on her machine… › What do we mean by a “Personal” Condor? Condor on your own workstation, no root access required, no system administrator intervention needed—easy to set up.
19
www.cs.wisc.edu/condor Personal Condor?! What’s the benefit of a Condor “Pool” with just one user and one machine?
20
www.cs.wisc.edu/condor Your Personal Condor will... › Keep an eye on your jobs and will keep you posted on their progress › Keep a log of your job activities › Add fault tolerance to your jobs › Implement your policy on when the jobs can run on your workstation
21
www.cs.wisc.edu/condor What’s in a Personal Condor? › Everything that is in Condor, just one machine. › Condor daemons: Condor_master Condor_collector—Stores ClassAds for jobs, machines Condor_negotiator—Matchmaking Condor_schedd—Submits, monitors jobs Condor_startd—Starts jobs Condor_starter—Launches a job Condor_shadow—Monitors remote job
22
www.cs.wisc.edu/condor A Condor Pool of One Condor_master Condor_schedd Condor_collector Condor_negotiator Condor_startd Condor_starter Condor job Condor_shadow
23
www.cs.wisc.edu/condor condor_master › Starts up all other Condor daemons › If there are any problems and a daemon exits, it restarts the daemon and sends email to the administrator › Checks the time stamps on the binaries of the other Condor daemons, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version
24
www.cs.wisc.edu/condor condor_master (cont’d) › Acts as the server for many Condor remote administration commands: condor_reconfig, condor_restart, condor_off, condor_on, condor_config_val, etc.
25
www.cs.wisc.edu/condor condor_startd › Represents a machine to the Condor system › Responsible for starting, suspending, and stopping jobs › Enforces the wishes of the machine owner (the owner’s “policy”… more on this soon)
26
www.cs.wisc.edu/condor condor_schedd › Represents users to the Condor system › Maintains the persistent queue of jobs › Responsible for contacting available machines and sending them jobs › Services user commands which manipulate the job queue: condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio, …
27
www.cs.wisc.edu/condor condor_collector › Collects information from all other Condor daemons in the pool “Directory Service” / Database for a Condor pool › Each daemon sends a periodic update called a “ClassAd” to the collector › Services queries for information: Queries from other Condor daemons Queries from users (condor_status)
28
www.cs.wisc.edu/condor condor_negotiator › Performs “matchmaking” in Condor › Gets information from the collector about all available machines and all idle jobs › Tries to match jobs with machines that will serve them › Both the job and the machine must satisfy each other’s requirements
29
www.cs.wisc.edu/condor Frieda wants more… › She decides to use the graduate students’ computers when they aren’t, and get done sooner. › In exchange, they can use the Condor pool too.
30
www.cs.wisc.edu/condor Frieda’s Condor pool… Frieda’s Computer: Central Manager Graduate Student’s Desktop Computers
31
www.cs.wisc.edu/condor A larger Condor pool Submitter Condor_master Condor_schedd Condor_shadow Collector Condor_master Condor_negotiator Condor_collector Submitter/Executor Condor_master Condor_scheddCondor_startd Condor_shadowCondor_starter Condor Job Executor Condor_master Condor_startd Condor_starter Condor Job
32
www.cs.wisc.edu/condor Happy Day! Frieda’s organization purchased a Beowulf Cluster! › Other scientists in her department have realized the power of Condor and want to share it.. › The Beowulf cluster and the graduate student computers can be part of a single Condor pool.
33
www.cs.wisc.edu/condor Frieda’s Condor pool… Central Manager Graduate Student’s Desktop Computers Beowulf Cluster
34
www.cs.wisc.edu/condor How would you set it up? › Grad student machines: Submitters Executors › Beowulf cluster machines Executors only › Independent machine for collector/neg Big job—take it away from Freida’s computer Could split collector and negotiator
35
www.cs.wisc.edu/condor Frieda collaborates… › She wants to share her Condor pool with scientists from another lab.
36
www.cs.wisc.edu/condor Condor Flocking › Condor pools can work cooperatively
37
www.cs.wisc.edu/condor How would you set it up? › Two independent pools Each has it’s own collector/negotiator › Set up flocking from one pool to another: by machine, or by pool. FLOCK_TO FLOCK_FROM › Can be uni- or bi-directional
38
www.cs.wisc.edu/condor Questions So Far?
39
www.cs.wisc.edu/condor How do you run a job? › It doesn’t matter if you have: Personal Condor Large Condor pool Condor pool with flocking › Four steps 1. Write program 2. Write submit file 3. Give it to Condor 4. Condor gives you the results
40
www.cs.wisc.edu/condor Step 1: Writing a program › Condor has universes Vanilla Universe: Run anything Less capable Java Universe: Works better for Java Standard Universe: Checkpointing Remote I/O Can’t work with all programs
41
www.cs.wisc.edu/condor Step 1: Vanilla Universe › You can run any program C/C++/Perl/Python/Fortran/Java/Lisp… No checkpointing: if your job is interrupted or the machine crashes, Condor has to restart it from the beginning. Can do anything you could do if you were logged in.
42
www.cs.wisc.edu/condor Step 1: Java Universe › Works better for Java programs › Checks for valid Java environment › Distinguishes Java environment exceptions from program exceptions (wrapper program) › No checkpointing (it could happen though) › Remote I/O
43
www.cs.wisc.edu/condor Step 1: Standard Universe › Requires re-linking your program condor_compile gcc –o simple simple.o › Allows checkpointing and remote I/O › Restrictions on behavior No threading Limited networking Restrictions on compiler used
44
www.cs.wisc.edu/condor Step 2: Write submit file Executable = simple Universe = vanilla Arguments = First Log = simple.log Output = simple.output Error = simple.error Requirements = Memory > 512 Queue Note: This assumes a shared filesystem
45
www.cs.wisc.edu/condor Step 2: Write submit file Executable = simple Universe = vanilla Arguments = First Log = simple.log Output = simple.output Error = simple.error Transfer_input_files = data.in Transfer_output_files = data.out Requirements = Memory > 512 Queue Note: This does not assume a shared filesystem
46
www.cs.wisc.edu/condor Step 2: Write submit file Executable = simple Universe = standard Arguments = First Log = simple.log Output = simple.output Error = simple.error Requirements = Memory > 512 Queue Note: This does not assume a shared filesystem, but remote I/O
47
www.cs.wisc.edu/condor Step 2: Submit Files › Condor is helpful: it makes a real requirements: Requirements = memory > 512 becomes… Requirements = (OpSys == “Linux”) && (memory > 512) && … › Queue can take a parameter (more later) › A single file can submit many jobs
48
www.cs.wisc.edu/condor Step 3: Give it to Condor › condor_submit submit.desc › condor_q -- Submitter: dsonokwa.cs.wisc.edu : : dsonokwa.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 roy 6/15 20:51 0+00:00:02 R 0 0.0 simple First 1 jobs; 0 idle, 1 running, 0 held
49
www.cs.wisc.edu/condor Step 4: Condor gives it back › The program’s output is where you asked it to be. › Condor left a log file documenting what it did. › Condor optionally sends you an email telling you it’s done.
50
www.cs.wisc.edu/condor Step 4: Condor gives it back 000 (34364.000.000) 06/15 21:00:01 Job submitted from host: 001 (34364.000.000) 06/15 21:00:01 Job executing on host: 005 (34364.000.000) 06/15 21:00:06 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
51
www.cs.wisc.edu/condor Step 4: Condor gives it back Date: Sat, 15 Jun 2002 21:00:06 -0500 (CDT) From: Condor Project Message-Id: To: roy@cs.wisc.edu Subject: [Condor] Condor Job 34364.0 This is an automated email from the Condor system on machine "beak.cs.wisc.edu". Do not reply. Your condor job exited with status 0. Job: /scratch/roy/condor/simple/simple First
52
www.cs.wisc.edu/condor Clusters and Processes › If your submit file describes multiple jobs, we call this a “cluster”. › Each job within a cluster is called a “process” or “proc”. › If you only specify one job, you still get a cluster, but it has only one process. › A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”) › Process numbers always start at 0.
53
www.cs.wisc.edu/condor Example Submit Description File for a Cluster # Example condor_submit input file that defines # a whole cluster of jobs at once Universe = standard Executable = simple Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/roy/condor/run.$(Process) Queue 500
54
www.cs.wisc.edu/condor Questions So Far?
55
www.cs.wisc.edu/condor condor_q › Find out status of your jobs, from your condor_schedd. › condor_q cluster: all jobs in a cluster › condor_q cluster.proc: particular job › condor_q –sub name: jobs for a particular user
56
www.cs.wisc.edu/condor Temporarily halt a Job › Use condor_hold to place a job on hold Kills job if currently running Will not attempt to restart job until released › Use condor_release to remove a hold and permit job to be scheduled again
57
www.cs.wisc.edu/condor condor_rm › You submitted a job, but you want to cancel it › condor_rm clusterid Condor_rm 6: all jobs in cluster › condor_rm clusterid.procid condor_rm 6.3: specific job › condor_rm –all: all of your jobs › Can only remove your jobs › Reflected in job log
58
www.cs.wisc.edu/condor condor_status › Find status of pool from condor_collector (simplified view here) Name OpSys Arch State Activity carmi.cs.wisc LINUX INTEL Unclaimed Idle coral.cs.wisc LINUX INTEL Unclaimed Idle doc.cs.wisc.e LINUX INTEL Unclaimed Idle dsonokwa.cs.w LINUX INTEL Unclaimed Idle... Machines Owner Claimed Unclaimed LINUX 12 2 0 10 SOLARIS28 5 0 0 5 Total 17 2 0 15
59
www.cs.wisc.edu/condor condor_status › condor_status –run: which machines are running jobs › condor_status –sub: whose jobs are running? › condor_status –constraint: restrict to showing subset as defined by user
60
www.cs.wisc.edu/condor DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)
61
www.cs.wisc.edu/condor What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D
62
www.cs.wisc.edu/condor Defining a DAG › A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D › each node will run the Condor job specified by its accompanying Condor submit file Job A Job BJob C Job D
63
www.cs.wisc.edu/condor Submitting a DAG › To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. › Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.
64
www.cs.wisc.edu/condor DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue C D A A B.dag File
65
www.cs.wisc.edu/condor DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue C D B C B A
66
www.cs.wisc.edu/condor DAGMan Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File
67
www.cs.wisc.edu/condor DAGMan Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C
68
www.cs.wisc.edu/condor DAGMan Recovering a DAG (cont’d) › Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor Job Queue C D A B D
69
www.cs.wisc.edu/condor DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B
70
www.cs.wisc.edu/condor Additional DAGMan Features › Provides other handy features for job management… nodes can have PRE & POST scripts failed nodes can be automatically re- tried a configurable number of times job submission can be “throttled”
71
www.cs.wisc.edu/condor Questions So Far?
72
www.cs.wisc.edu/condor What if each job needed to run for 20 days? What if I wanted to interrupt a job with a higher priority job?
73
www.cs.wisc.edu/condor Condor’s Standard Universe to the rescue! › Condor can support various combinations of features/environments in different “Universes” › Different Universes provide different functionality for your job: Vanilla—runs any Serial Job Java—well suited for Java programs Standard – Support for transparent process checkpoint and restart
74
www.cs.wisc.edu/condor Process Checkpointing › Condor’s Process Checkpointing mechanism saves all the state of a process into a checkpoint file Memory, CPU, I/O, etc. › The process can then be restarted from right where it left off › Typically no changes to your job’s source code needed – however, your job must be relinked with Condor’s Standard Universe support library
75
www.cs.wisc.edu/condor Linking for Standard Universe To do this, just place “condor_compile” in front of the command you normally use to link your job: condor_compile gcc -o myjob myjob.c OR condor_compile f77 -o myjob filea.f fileb.f
76
www.cs.wisc.edu/condor Limitations in the Standard Universe › Condor’s checkpointing is not at the kernel level. Thus in the Standard Universe the job may not Fork() Use kernel threads Use some forms of IPC, such as pipes and shared memory › Many typical scientific jobs are OK
77
www.cs.wisc.edu/condor When will Condor checkpoint your job? › Periodically, if desired For fault tolerance › To free the machine to do a higher priority task (higher priority job, or a job from a user with higher priority) Preemptive-resume scheduling › When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command
78
www.cs.wisc.edu/condor Administering Condor › Condor provides extensive configuration files One per pool, one per machine, or anything in between › Extensive documentation Online manual Heavily commented sample configuration file
79
www.cs.wisc.edu/condor I am adding nodes to the Cluster… but the Chemistry Department has priority on these nodes. (Boss Fat Cat) Policy Configuration
80
www.cs.wisc.edu/condor The Machine (Startd) Policy Expressions START – When is this machine willing to start a job RANK - Job Preferences SUSPEND - When to suspend a job CONTINUE - When to continue a suspended job PREEMPT – When to nicely stop running a job KILL - When to immediately kill a preempting job
81
www.cs.wisc.edu/condor Freida’s Current Settings START = True RANK = SUSPEND = False CONTINUE = PREEMPT = False KILL = False
82
www.cs.wisc.edu/condor Freida’s New Settings for the Chemistry nodes START = True RANK = Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False
83
www.cs.wisc.edu/condor Submit file with Custom Attribute Executable = chem-job Universe = standard +Department = Chemistry queue
84
www.cs.wisc.edu/condor What if “Department” not specified? START = True RANK = Department =!= UNDEFINED && Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False
85
www.cs.wisc.edu/condor Another example START = True RANK = Department =!= UNDEFINED && ((Department == “Chemistry”)*2 + Department == “Physics”) SUSPEND = False CONTINUE = PREEMPT = False KILL = False
86
www.cs.wisc.edu/condor The Cluster is fine. But not the desktop machines. Condor can only use the desktops when they would otherwise be idle. (Boss Fat Cat) Policy Configuration, cont
87
www.cs.wisc.edu/condor So Frieda decides she wants the desktops to: › START jobs when their has been no activity on the keyboard/mouse for 5 minutes and the load average is low › SUSPEND jobs as soon as activity is detected › PREEMPT jobs if the activity continues for 5 minutes or more › KILL jobs if they take more than 5 minutes to preempt
88
www.cs.wisc.edu/condor Macros in the Config File NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) BackgroundLoad = 0.3 HighLoad = 0.5 KeyboardBusy = (KeyboardIdle < 10) CPU_Idle = ($(NonCondorLoadAvg) <= $(Background)) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer= (CurrentTime - EnteredCurrentActivity)
89
www.cs.wisc.edu/condor Desktop Machine Policy START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND= $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT= (Activity == "Suspended") && $(ActivityTimer) > 300 KILL = $(ActivityTimer) > 300
90
www.cs.wisc.edu/condor Policy Review › Users submitting jobs can specify Requirements and Rank expressions › Administrators can specify Startd Policy expressions individually for each machine (Start,Suspend,etc) › Expressions can use any job or machine ClassAd attribute › Custom attributes easily added › Bottom Line: Enforce almost any policy!
91
www.cs.wisc.edu/condor Administrator Commands › condor_vacateLeave a machine now › condor_onStart Condor › condor_offStop Condor › condor_reconfigReconfig on-the-fly › condor_config_valView/set config › condor_userprioUser Priorities › condor_statsView detailed usage accounting stats
92
www.cs.wisc.edu/condor Questions So Far?
93
www.cs.wisc.edu/condor Security in Condor › Since version 6.3.3, Condor has greatly improved security › Multiple authentication methods: X509 (Using GSI) Kerberos Filesystem (shared filesystem, known user) › Encryption: 3DES Blowfish
94
www.cs.wisc.edu/condor Security in Condor › Authentication Based on users, with optional wildcards roy@cs.wisc.edu *@cs.wisc.edu Users can be given different permissions: Read Write Administrator Config
95
www.cs.wisc.edu/condor Version Numbers in Condor › Odd minor numbers are development releases: 6.3.1, 6.3.2, 6.5.0… Compatibility not guaranteed within a series, like 6.3.x. › Even minor numbers are stable releases 6.2.2, 6.4.0, 6.4.1… Compatibility guaranteed within a series, like 6.4.x.
96
www.cs.wisc.edu/condor Questions? Comments? › Web: www.cs.wisc.edu/condor › Email: condor-admin@cs.wisc.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.