Download presentation
Presentation is loading. Please wait.
Published byJamie Robison Modified over 10 years ago
1
Condor Project Computer Sciences Department University of Wisconsin-Madison Installing and Using Condor
2
www.cs.wisc.edu/Condor 2 What is Condor? › High-Throughput Computing system Emphasizes long-term productivity › Many features for local and global computing › Limited focus for today Managing a cluster of machines and the jobs that will run on them
3
www.cs.wisc.edu/Condor Condor Pool Machine Roles › Central Manager Matches jobs to machines Daemons: master, collector, negotiator › Submit Machine Manages jobs Daemons: master, schedd › Execute Machine Runs jobs Daemons: master, startd › Every machine plays one or more of these roles 3
4
www.cs.wisc.edu/Condor 4 Condor Daemon Layout Personal Condor / Central Manager Master collector negotiator startd = Process Spawned schedd
5
www.cs.wisc.edu/Condor 5 condor_master › Starts up all other Condor daemons › Runs on all Condor hosts › If there are any problems and a daemon exits, it restarts the daemon and sends email to the administrator › Acts as the server for many Condor remote administration commands: condor_reconfig, condor_restart condor_off, condor_on condor_config_val etc.
6
www.cs.wisc.edu/Condor 6 Central Manager: condor_collector › Collects information from all other Condor daemons in the pool “Directory Service” / Database for a Condor pool Each daemon sends a periodic update ClassAd to the collector › Services queries for information: Queries from other Condor daemons Queries from users (condor_status) › Only on the Central Manager(s) › At least one collector per pool
7
www.cs.wisc.edu/Condor 7 Condor Pool Layout: Collector = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector negotiator
8
www.cs.wisc.edu/Condor 8 Central Manager: condor_negotiator › Performs “matchmaking” in Condor › Each “Negotiation Cycle” (typically 5 minutes): Gets information from the collector about all available machines and all idle jobs Tries to match jobs with machines that will serve them Both the job and the machine must satisfy each other’s requirements › Only one Negotiator per pool Ignoring HAD › Only on the Central Manager(s)
9
www.cs.wisc.edu/Condor 9 Condor Pool Layout: Negotiator = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector negotiator
10
www.cs.wisc.edu/Condor 10 Execute Hosts: condor_startd › Represents a machine to the Condor system › Responsible for starting, suspending, and stopping jobs › Enforces the wishes of the machine owner (the owner’s “policy”… more on this in the administrator’s tutorial) › Creates a “starter” for each running job › One startd runs on each execute node
11
www.cs.wisc.edu/Condor 11 Condor Pool Layout: startd = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector schedd negotiator Cluster Node Master startd Cluster Node Master startd Workstation Master startd schedd Workstation Master startd schedd
12
www.cs.wisc.edu/Condor 12 Submit Hosts: condor_schedd › Condor’s Scheduler Daemon › One schedd runs on each submit host › Maintains the persistent queue of jobs › Responsible for contacting available machines and sending them jobs › Services user commands which manipulate the job queue: condor_submit, condor_rm, condor_q, condor_hold, condor_release, condor_prio, … › Creates a “shadow” for each running job
13
www.cs.wisc.edu/Condor 13 Condor Pool Layout: schedd = ClassAd Communication Pathway = Process Spawned Cluster Node Master startd Cluster Node Master startd Central Manager Master Collector schedd negotiator Workstation Master startd schedd Workstation Master startd schedd
14
www.cs.wisc.edu/Condor 14 Condor Pool Layout: master = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector schedd negotiator Cluster Node Master startd Cluster Node Master startd Cluster Node Master startd schedd Cluster Node Master startd schedd
15
www.cs.wisc.edu/Condor 15 Execute MachineSubmit Machine Job Startup submit schedd starter Job shadow Condor Syscall Lib startd Central Manager collectornegotiator Q Q J S Q Q S J J S J J S S master
16
www.cs.wisc.edu/Condor 16 Condor ClassAds
17
www.cs.wisc.edu/Condor 17 What is a ClassAd? › Condor’s internal data representation Similar to a classified ad in a newspaper Or Craig’s list Or 58.com, baixing.com, ganji.com Represent an object & its attributes Usually many attributes Can also describe what an object matches with
18
www.cs.wisc.edu/Condor 18 ClassAd Types › Condor has many types of ClassAds A Job ClassAd represents a job to Condor Condor_q –long shows full job ClassAds A Machine ClassAd represents a machine within the Condor pool Condor_status –long shows full machine ClassAds Other ClassAds represent other pieces of the Condor pool Job and Machine ClassAds are matched to each other by the negotiator daemon
19
www.cs.wisc.edu/Condor 19 ClassAds Explained › ClassAds can contain a lot of details The job’s executable is "cosmos" The machine’s load average is 5.6 › ClassAds can specify requirements My job requires a machine with Linux › ClassAds can specify rank This machine prefers to run jobs from the physics group
20
www.cs.wisc.edu/Condor 20 Example Machine Ad [root@creamce ~]# condor_status –l Machine = "creamce.foo" EnteredCurrentState = 1305040012 JavaVersion = "1.4.2" CpuIsBusy = false CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.500000 ) TotalVirtualMemory = 1605580 LoadAvg = 0.0 CondorLoadAvg = 0.0... [root@creamce ~]#
21
www.cs.wisc.edu/Condor Hostname Configuration [root@test17 ~]# cat /etc/hosts # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 10.1.1.161 test01.epikh test01 10.1.1.162 test02.epikh test02 10.1.1.163 test03.epikh test03 10.1.1.164 test04.epikh test04 10.1.1.165 test05.epikh test05 10.1.1.166 test06.epikh test06 10.1.1.167 test07.epikh test07 10.1.1.168 test08.epikh test08 10.1.1.169 test18.epikh test18 10.1.1.171 test09.epikh test09 10.1.1.172 test10.epikh test10 10.1.1.173 test11.epikh test11 10.1.1.174 test12.epikh test12 10.1.1.175 test13.epikh test13 10.1.1.176 test14.epikh test14 10.1.1.177 test15.epikh test15 10.1.1.178 test16.epikh test16 10.1.1.179 test17.epikh test17 [root@test17 ~]# hostname test##.epikh [root@test17 ~]# 21
22
www.cs.wisc.edu/Condor Normal Condor Installation (Don’t Do This Today) › Goto Condor’s Yum Repository Page http://www.cs.wisc.edu/condor/yum/ › Follow the instructions there Use condor-stable-rhel4.repo Ignore the optional steps 22
23
www.cs.wisc.edu/Condor Normal Condor Installation (Don’t Do This Today) › Example cd /etc/yum.repos.d wget http://www.cs.wisc.edu/condor/yum/rep o.d/condor-stable-rhel5.repo yum install condor.x86_64 service condor start ps -ef | grep condor 23
24
www.cs.wisc.edu/Condor Condor Install For Today › We’ll use a locally-cached copy of Condor cd /root wget http://10.4.11.28/~jfrey/condor/condo r-7.6.0-1.rhel5.x86_64.rpm yum localinstall condor-7.6.0- 1.rhel5.x86_64.rpm service condor start ps -ef | grep condor 24
25
www.cs.wisc.edu/Condor Good Install Results [root@creamce ~]# ps -ef|grep condor condor 10898 1 0 21:32 ? 00:00:00 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid condor 10899 10898 0 21:32 ? 00:00:00 condor_collector -f condor 10900 10898 0 21:32 ? 00:00:00 condor_negotiator -f condor 10901 10898 0 21:32 ? 00:00:00 condor_schedd -f condor 10902 10898 0 21:32 ? 00:00:00 condor_startd -f root 10903 10901 0 21:32 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 101 root 10945 10763 0 21:38 pts/0 00:00:00 grep condor [root@creamce ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime creamce.foo LINUX X86_64 Unclaimed Idle 0.000 768 0+00:04:42 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 1 0 0 1 0 0 0 Total 1 0 0 1 0 0 0 [root@creamce ~]# 25
26
www.cs.wisc.edu/Condor Running a Job › Create a regular user account and switch to it adduser joe su - joe › Create a submit description file › Call condor_submit › Monitor job’s status with condor_q 26
27
www.cs.wisc.edu/Condor 27 Simple Submit Description File # simple submit description file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /bin/date Job's executable #Input = /dev/null Job's STDIN Output = date.out Job's STDOUT Error = date.err Job's STDERR Log = date.log Log the job's activities Queue Put the job in the queue
28
www.cs.wisc.edu/Condor Submitting the Job [jfrey@creamce ~]$ condor_submit date.sub Submitting job(s). 1 job(s) submitted to cluster 4. [jfrey@creamce ~]$ condor_q -- Submitter: creamce.foo : : creamce.foo ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 jfrey 5/10 22:19 0+00:00:00 I 0 0.1 date 1 jobs; 1 idle, 0 running, 0 held [jfrey@creamce ~]$ condor_q -- Submitter: creamce.foo : : creamce.foo ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held [jfrey@creamce ~]$ 28
29
www.cs.wisc.edu/Condor Try a Longer Job › The ‘I’ in condor_q means the job is idle › While a job is running, condor_q will show an ‘R’ and the RUN_TIME will increase › To see a job as it runs, try making a script that sleeps for a minute: #!/bin/sh echo Hello sleep 60 echo Goodbye › Don’t forget to run chmod 755 on it 29
30
www.cs.wisc.edu/Condor 30 Sample Job Log [jfrey@creamce ~]$ cat date.log 000 (005.000.000) 05/10 22:28:41 Job submitted from host:... 001 (005.000.000) 05/10 22:28:42 Job executing on host:... 005 (005.000.000) 05/10 22:28:42 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job... [jfrey@creamce ~]$
31
www.cs.wisc.edu/Condor 31 Jobs, Clusters, and Processes › If the submit description file describes multiple jobs, it is called a cluster › Each cluster has a cluster number, where the cluster number is unique to the job queue on a machine › Each individual job within a cluster is called a process, and process numbers always start at zero › A Condor Job ID is the cluster number, a period, and the process number (i.e. 2.1) A cluster can have a single process Job ID = 20.0·Cluster 20, process 0 Or, a cluster can have more than one process Job IDs: 21.0, 21.1, 21.2·Cluster 21, process 0, 1, 2
32
www.cs.wisc.edu/Condor 32 Submitting Several Jobs # Example submit file for a cluster of 2 jobs # with separate output, error and log files Universe = vanilla Executable = /bin/date log = date_0.log Output = date_0.out Error = date_0.err Queue ·Job 102.0 (cluster 102, process 0) log = date_1.log Output = date_1.out Error = date_1.err Queue ·Job 102.1 (cluster 102, process 1)
33
www.cs.wisc.edu/Condor 33 Submitting Many Jobs # Example submit file for a cluster of 10 jobs # with separate output, error and log files Universe = vanilla Executable = /bin/date log = date_$(cluster).$(process).log Output = date_$(cluster).$(process).out Error = date_$(cluster).$(process).err Queue 10 ·Jobs 102.0 – 102.9 $(cluster) and $(process) are replaced with each job’s Cluster and Process id.
34
www.cs.wisc.edu/Condor 34 Removing Jobs › If you want to remove a job from the Condor queue, you use condor_rm › You can only remove jobs that you own › Privileged user can remove any jobs “root” on UNIX / Linux “administrator” on Windows
35
www.cs.wisc.edu/Condor 35 Removing jobs (continued) › Remove an entire cluster: condor_rm 4 ·Removes the whole cluster › Remove a specific job from a cluster: condor_rm 4.0 ·Removes a single job › Or, remove all of your jobs with “-a” DANGEROUS!! condor_rm -a ·Removes all jobs / clusters
36
www.cs.wisc.edu/Condor 36 My Jobs Are Idle › Our scientist runs condor_q and finds all his jobs are idle [einstein@submit ~]$ condor_q -- Submitter: x.cs.wisc.edu : :x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 einstein 4/20 13:22 0+00:00:00 I 0 9.8 cosmos -arg1 –arg2 5.0 einstein 4/20 12:23 0+00:00:00 I 0 9.8 cosmos -arg1 –n 0 5.1 einstein 4/20 12:23 0+00:00:00 I 0 9.8 cosmos -arg1 –n 1 5.2 einstein 4/20 12:23 0+00:00:00 I 0 9.8 cosmos -arg1 –n 2 5.3 einstein 4/20 12:23 0+00:00:00 I 0 9.8 cosmos -arg1 –n 3 5.4 einstein 4/20 12:23 0+00:00:00 I 0 9.8 cosmos -arg1 –n 4 5.5 einstein 4/20 12:23 0+00:00:00 I 0 9.8 cosmos -arg1 –n 5 5.6 einstein 4/20 12:23 0+00:00:00 I 0 9.8 cosmos -arg1 –n 6 5.7 einstein 4/20 12:23 0+00:00:00 I 0 9.8 cosmos -arg1 –n 7 8 jobs; 8 idle, 0 running, 0 held
37
www.cs.wisc.edu/Condor 37 Exercise a little patience › On a busy pool, it can take a while to match and start your jobs › Wait at least a negotiation cycle or two (typically a few minutes)
38
www.cs.wisc.edu/Condor 38 Check Machine's Status [einstein@submit ~]$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@c002.chtc.wi LINUX X86_64 Claimed Busy 1.000 4599 0+00:10:13 slot2@c002.chtc.wi LINUX X86_64 Claimed Busy 1.000 1024 1+19:10:36 slot3@c002.chtc.wi LINUX X86_64 Claimed Busy 0.990 1024 1+22:42:20 slot4@c002.chtc.wi LINUX X86_64 Claimed Busy 1.000 1024 0+03:22:10 slot5@c002.chtc.wi LINUX X86_64 Claimed Busy 1.000 1024 0+03:17:00 slot6@c002.chtc.wi LINUX X86_64 Claimed Busy 1.000 1024 0+03:09:14 slot7@c002.chtc.wi LINUX X86_64 Claimed Busy 1.000 1024 0+19:13:49... vm1@INFOLABS-SML65 WINNT51 INTEL Owner Idle 0.000 511 [Unknown] vm2@INFOLABS-SML65 WINNT51 INTEL Owner Idle 0.030 511 [Unknown] vm1@INFOLABS-SML66 WINNT51 INTEL Unclaimed Idle 0.000 511 [Unknown] vm2@INFOLABS-SML66 WINNT51 INTEL Unclaimed Idle 0.010 511 [Unknown] vm1@infolabs-smlde WINNT51 INTEL Claimed Busy 1.130 511 [Unknown] vm2@infolabs-smlde WINNT51 INTEL Claimed Busy 1.090 511 [Unknown] Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/WINNT51 104 78 16 10 0 0 0 X86_64/LINUX 759 170 587 0 0 1 0 Total 863 248 603 10 0 1 0
39
www.cs.wisc.edu/Condor 39 Not Matching at All? condor_q –analyze [einstein@submit ~]$ condor_q -analyze 29 The Requirements expression for your job is: ( (target.Memory > 8192) ) && (target.Arch == "INTEL") && (target.OpSys == "LINUX") && (target.Disk >= DiskUsage) && (TARGET.FileSystemDomain == MY.FileSystemDomain) Condition Machines Matched Suggestion --------- ----------- -------- ----------- 1 ( ( target.Memory > 8192 ) ) 0 MODIFY TO 4000 2 ( TARGET.FileSystemDomain == "cs.wisc.edu" )584 3 ( target.Arch == "INTEL" ) 1078 4 ( target.OpSys == "LINUX" ) 1100 5 ( target.Disk >= 13 ) 1243
40
www.cs.wisc.edu/Condor 40 Learn about available resources: [einstein@submit ~]$ condor_status –const 'Memory > 8192' (no output means no matches) [einstein@submit ~]$ condor_status -const 'Memory > 4096' Name OpSys Arch State Activ LoadAv Mem ActvtyTime vm1@s0-03.cs. LINUX X86_64 Unclaimed Idle 0.000 5980 1+05:35:05 vm2@s0-03.cs. LINUX X86_64 Unclaimed Idle 0.000 5980 13+05:37:03 vm1@s0-04.cs. LINUX X86_64 Unclaimed Idle 0.000 7988 1+06:00:05 vm2@s0-04.cs. LINUX X86_64 Unclaimed Idle 0.000 7988 13+06:03:47 Total Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 4 0 0 4 0 0 Total 4 0 0 4 0 0
41
www.cs.wisc.edu/Condor 41 Submit a Job That Won’t Run Universe = vanilla Executable = /bin/date Output = date.out Error = date.err # Our machine doesn’t have this much # memory Requirements = Memory > 8192 Log = date.log Queue
42
www.cs.wisc.edu/Condor 42 Submit and Run condor_q -analyze -- Submitter: test17.epikh : : test17.epikh --- 009.000: Run analysis summary. Of 4 machines, 4 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 0 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 match but are currently offline 0 are available to run your job WARNING: Be advised: No resources matched request's constraints The Requirements expression for your job is: ( target.Memory > 8192 ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.FileSystemDomain == MY.FileSystemDomain ) Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( target.Memory > 8192 ) 0 MODIFY TO 191 2 ( TARGET.Arch == "X86_64" ) 4 3 ( TARGET.OpSys == "LINUX" ) 4 4 ( TARGET.Disk >= 1 ) 4 5 ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt undefined,JobVMMemory,9.765625000000000E-04)) ) >= 1 ) 4 6 ( TARGET.FileSystemDomain == "test17.epikh" )4
43
www.cs.wisc.edu/Condor 43 Held Jobs › Condor may place your jobs on hold if there’s a problem running them… [einstein@submit ~]$ condor_q -- Submitter: x.cs.wisc.edu : :x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 einstein 4/20 13:22 0+00:00:00 H 0 9.8 cosmos -arg1 –arg2 5.0 einstein 4/20 12:23 0+00:00:00 H 0 9.8 cosmos -arg1 –n 0 5.1 einstein 4/20 12:23 0+00:00:00 H 0 9.8 cosmos -arg1 –n 1 5.2 einstein 4/20 12:23 0+00:00:00 H 0 9.8 cosmos -arg1 –n 2 5.3 einstein 4/20 12:23 0+00:00:00 H 0 9.8 cosmos -arg1 –n 3 5.4 einstein 4/20 12:23 0+00:00:00 H 0 9.8 cosmos -arg1 –n 4 5.5 einstein 4/20 12:23 0+00:00:00 H 0 9.8 cosmos -arg1 –n 5 5.6 einstein 4/20 12:23 0+00:00:00 H 0 9.8 cosmos -arg1 –n 6 5.7 einstein 4/20 12:23 0+00:00:00 H 0 9.8 cosmos -arg1 –n 7 8 jobs; 0 idle, 0 running, 8 held
44
www.cs.wisc.edu/Condor 44 Look at jobs on hold [einstein@submit ~]$ condor_q –hold -- Submiter: submit.chtc.wisc.edu : :submit.chtc.wisc.edu ID OWNER HELD_SINCE HOLD_REASON 6.0 einstein 4/20 13:23 Error from starter on skywalker.cs.wisc.edu 9 jobs; 8 idle, 0 running, 1 held Or, see full details for a job [einstein@submit ~]$ condor_q –l 6.0 … HoldReason = "Error from starter" …
45
www.cs.wisc.edu/Condor 45 Look in the Job Log › The job log will likely contain clues: [einstein@submit ~]$ cat cosmos.log 000 (031.000.000) 04/20 14:47:31 Job submitted from host:... 007 (031.000.000) 04/20 15:02:00 Shadow exception! Error from starter on gig06.stat.wisc.edu: Failed to open '/scratch.1/einstein/workspace/v67/condor- test/test3/run_0/cosmos.in' as standard input: No such file or directory (errno 2) 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job...
46
www.cs.wisc.edu/Condor 46 Holding Jobs › You can put jobs in the HELD state yourself, using condor_hold Same syntax and rules as condor_rm › You can take jobs out of the HELD state with the condor_release command Again, same syntax and rules as condor_rm
47
www.cs.wisc.edu/Condor Configuration Files › “amp wiring” by “fbz_” © 2005 › Licensed under the Creative Commons Attribution 2.0 license › http://www.flickr.c om/photos/fbz/114 422787/
48
www.cs.wisc.edu/Condor 48 Configuration File › Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or ~condor/condor_config › All settings can be in this one file › Might want to share between all machines (NFS, automated copies, Wallaby, etc)
49
www.cs.wisc.edu/Condor 49 Other Configuration Files › LOCAL_CONFIG_FILE setting Comma separated, processed in order LOCAL_CONFIG_FILE = \ /var/condor/config.local,\ /var/condor/policy.local,\ /shared/condor/config.$(HOSTNAME),\ /shared/condor/config.$(OPSYS)
50
www.cs.wisc.edu/Condor 50 Configuration File Syntax # I’m a comment! CREATE_CORE_FILES=TRUE MAX_JOBS_RUNNING = 50 # Condor ignores case: log=/var/log/condor # Long entries: collector_host=condor.cs.wisc.edu,\ secondary.cs.wisc.edu
51
www.cs.wisc.edu/Condor 51 Configuration File Macros › You reference other macros (settings) with: SBIN = /usr/sbin SCHEDD = $(SBIN)/condor_schedd › Can create additional macros for organizational purposes
52
www.cs.wisc.edu/Condor Tools › “Tools” by “batega” © 2007 Licensed under Creative Commons Attribution 2.0 license › http://www.flickr.com/photos/batega/15968 98776/ http://www.webcitation.org/5XIj1E1Y1
53
www.cs.wisc.edu/Condor 53 Administrator Commands › condor_vacateLeave a machine now › condor_onStart Condor › condor_offStop Condor › condor_reconfigReconfig on-the-fly › condor_config_valView/set config › condor_userprioUser Priorities › condor_statsView detailed usage accounting stats
54
www.cs.wisc.edu/Condor 54 condor_config_val › Find current configuration values % condor_config_val MASTER_LOG /var/condor/logs/MasterLog % cd `condor_config_val LOG`
55
www.cs.wisc.edu/Condor 55 condor_config_val -v › Can identify source % condor_config_val –v CONDOR_HOST CONDOR_HOST: condor.cs.wisc.edu Defined in ‘/etc/condor_config.hosts’, line 6
56
www.cs.wisc.edu/Condor 56 condor_config_val -config › What configuration files are being used? % condor_config_val –config Config source: /var/home/condor/condor_config Local config sources: /unsup/condor/etc/condor_config.hosts /unsup/condor/etc/condor_config.global /unsup/condor/etc/condor_config.policy /unsup/condor-test/etc/hosts/puffin.local
57
www.cs.wisc.edu/Condor 57 condor_fetchlog › Retrieve logs remotely condor_fetchlog beak.cs.wisc.edu Master
58
www.cs.wisc.edu/Condor 58 Querying daemons condor_status › Queries the collector for information about daemons in your pool › Defaults to finding condor_startd s › condor_status –schedd summarizes all job queues › condor_status –master returns list of all condor_master s
59
www.cs.wisc.edu/Condor 59 condor_status › -long displays the full ClassAd › Optionally specify a machine name to limit results to a single host condor_status –l node4.cs.wisc.edu
60
www.cs.wisc.edu/Condor 60 condor_status -constraint › Only return ClassAds that match an expression you specify › Show me idle machines with 1GB or more memory condor_status -constraint 'Memory >= 1024 && Activity == "Idle"'
61
www.cs.wisc.edu/Condor 61 condor_status -format › Controls format of output › Useful for writing scripts › Uses C printf style formats One field per argument “slanting” by Stefano Mortellaro (“fazen”) © 2005 Licensed under the Creative Commons Attribution 2.0 license http://www.flickr.com/photos/fazen/17200735/ http://www.webcitation.org/5XIhNWC7Y
62
www.cs.wisc.edu/Condor 62 condor_status -format › Census of systems in your pool: % condor_status -format '%s ' Arch -format '%s\n' OpSys | sort | uniq –c 797 INTEL LINUX 118 INTEL WINNT50 108 SUN4u SOLARIS28 6 SUN4x SOLARIS28
63
www.cs.wisc.edu/Condor 63 Examining Queues condor_q › View the job queue › The “ -long ” option is useful to see the entire ClassAd for a given job › supports –constraint and -format › Can view job queues on remote machines with the “ -name ” option
64
www.cs.wisc.edu/Condor 64 condor_q -format › Census of jobs per user % condor_q -format ’%s ' Owner -format '%s\n' Cmd | sort | uniq –c 64 adesmet /scratch/submit/a.out 2 adesmet /home/bin/run_events 4 smith /nfs/sim1/em2d3d 4 smith /nfs/sim2/em2d3d
65
www.cs.wisc.edu/Condor 65 condor_q -analyze › condor_q will try to figure out why the job isn’t running › Good at determining that no machine matches the job Requirements expressions
66
www.cs.wisc.edu/Condor 66 condor_q -analyze › Typical intro: % condor_q –analyze 471216 471216.000: Run analysis summary. Of 820 machines, 458 are rejected by your job's requirements 25 reject your job because of their own requirements 0 match, but are serving users with a better priority in the pool 4 match, but reject the job for unknown reasons 6 match, but will not currently preempt their existing job 327 are available to run your job Last successful match: Sun Apr 27 14:32:07 2008
67
www.cs.wisc.edu/Condor 67 condor_q -analyze › Continued, and heavily truncated: The Requirements expression for your job is: ( ( target.Arch == "SUN4u" ) && ( target.OpSys == "WINNT50" ) && [snip] Condition Machines Suggestion 1 (target.Disk > 100000000) 0 MODIFY TO 14223201 2 (target.Memory > 10000) 0 MODIFY TO 2047 3 (target.Arch == "SUN4u") 106 4 (target.OpSys == "WINNT50") 110 MOD TO "SOLARIS28" Conflicts: conditions: 3, 4
68
www.cs.wisc.edu/Condor Adding Machines to Your Pool › Install Condor on new machines › Modify security settings on all machines to trust each other › Modify condor_config.local on new machines DAEMON_LIST : remove unwanted daemons CONDOR_HOST : set to hostname of central manager › Start Condor on new machines 68
69
www.cs.wisc.edu/Condor Let’s Make a Big Pool › Edit /etc/condor/condor_config.local DAEMON_LIST = MASTER, SCHEDD, STARTD CONDOR_HOST = test17.epikh ALLOW_WRITE = 10.1.1.* ALLOW_ADMINISTRATOR = $(FULL_HOSTNAME), \ $(CONDOR_HOST) NUM_CPUS = 4 › Run condor_restart -master › condor_status should show more machines May take a couple minutes 69
70
www.cs.wisc.edu/Condor 70 Security › We’re using host-based security Trust all packets from given IP addresses Only OK on a private network › Stronger security options Pool password OpenSSL GSI (with optional VOMS) Kerberos
71
www.cs.wisc.edu/Condor File Transfer › If your job needs data files, you’ll need to have Condor transfer them for you › Likewise, Condor can transfer results files back for you › You need to place your data files in a place where Condor can access them › Sounds Great! What do I need to do?
72
www.cs.wisc.edu/Condor 72 Specify File Transfer Lists In your submit file: › Transfer_Input_Files List of files for Condor to transfer from the submit machine to the execute machine › Transfer_Output_Files List of files for Condor to transfer back from the execute machine to the submit machine If not specified, Condor will transfer back all “new” files in the execute directory
73
www.cs.wisc.edu/Condor 73 Condor File Transfer Controls Should_Transfer_Files YES: Always transfer files to execution site NO: Always rely on a shared file system IF_NEEDED: Condor will automatically transfer the files, if the submit and execute machine are not in the same FileSystemDomain Translation: Use shared file system if available When_To_Transfer_Output ON_EXIT: T ransfer the job's output files back to the submitting machine only when the job completes ON_EXIT_OR_EVICT: Like above, but also when the job is evicted
74
www.cs.wisc.edu/Condor 74 File Transfer Example # Example using file transfer Universe= vanilla Executable= cosmos Log = cosmos.log ShouldTransferFiles= YES Transfer_input_files= cosmos.dat Transfer_output_files= results.dat When_To_Transfer_Output= ON_EXIT Queue
75
www.cs.wisc.edu/Condor Create a Job That Uses Input and Output Files › Sample script #!/bin/sh echo Directory listing /bin/ls -l echo Here is my input file cat $1 sleep 5 › Sample input file I am the job’s input! 75
76
www.cs.wisc.edu/Condor Submit Your New Job › Submit description file universe = vanilla executable = test.sh arguments = test.input output = out.$(cluster).$(process) error = err.$(cluster).$(process) transfer_input_files = test.input should_transfer_files = YES when_to_transfer_output = ON_EXIT queue 10 76
77
www.cs.wisc.edu/Condor 77 More Information › http://www.cs.wisc.edu/condor › http://www.cs.wisc.edu/condor/manual/v7.6 › https://condor-wiki.cs.wisc.edu/index.cgi/wiki › condor-users mailing list › condor-admin@cs.wisc.edu support email
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.