Download presentation
Presentation is loading. Please wait.
1
Douglas Thain Computer Sciences Department University of Wisconsin-Madison (In Bologna for June 2000) thain@cs.wisc.edu http://www.cs.wisc.edu/condor Condor by Example
2
www.cs.wisc.edu/condor Outline › Overview › Submitting Jobs, Getting Feedback › Setting Requirements with ClassAds › Using LOTS of Machines › Which Universe? › Conclusion
3
www.cs.wisc.edu/condor What is Condor? › Condor converts a collection of unrelated workstations into a high- throughput computing facility. › Condor uses matchmaking to make sure that everyone is happy.
4
www.cs.wisc.edu/condor What is High-Throughput Computing? › High-performance: CPU cycles/second under ideal circumstances. “How fast can I run simulation X on this machine?” › High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. “How many times can I run simulation X in the next week using all available machines?”
5
www.cs.wisc.edu/condor What is High-Throughput Computing? › Condor does whatever it takes to run your jobs, even if some machines… Crash! Are disconnected Run out of disk space Are removed or added from the pool Are put to other uses
6
www.cs.wisc.edu/condor What is Matchmaking? › Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. › Users (jobs) have constraints: “I need an Alpha with 256 MB RAM” › Owners (machines) have constraints: “Only run jobs when I am away from my desk and never run jobs owned by Bob.”
7
www.cs.wisc.edu/condor “What can Condor do for me?” Condor can… › …increase your throughput. › …do your housekeeping. › …improve reliability. › …give performance feedback.
8
www.cs.wisc.edu/condor The INFN Condor Pool
9
www.cs.wisc.edu/condor How many machines now? › The map is out of date! › The system is always changing. › First example: What machines (and of what kind) are in the pool now?
10
www.cs.wisc.edu/condor First Things First › Set your path: setenv PATH /library/condor_nfs/XXX/bin › XXX should be your system: OSF1, LINUX, SOLARIS26, HPUX10 …
11
www.cs.wisc.edu/condor How Many Machines? % condor_status Name OpSys Arch State Activity LoadAv Mem lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle 0.000 30 axpd21.pd.inf OSF1 ALPHA Owner Idle 0.266 96 vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy 0.000 256... Machines Owner Claimed Unclaimed Matched Preempting ALPHA/OSF1 115 67 46 1 0 1 INTEL/LINUX 53 18 0 35 0 0 INTEL/LINUX-GLIBC 16 7 0 9 0 0 SUN4u/SOLARIS251 1 1 0 0 0 0 SUN4u/SOLARIS26 6 2 0 4 0 0 SUN4u/SOLARIS27 1 1 0 0 0 0 SUN4x/SOLARIS26 2 1 0 1 0 0 Total 194 97 46 50 0 1
12
www.cs.wisc.edu/condor Machine States › Most machines will be: Owner: The machine’s owner is busy at the console, so no Condor jobs may run. Claimed: Condor has selected the machine to run jobs for other users.
13
www.cs.wisc.edu/condor Machine States › Only a few should be: Unclaimed: The owner is gone, but Condor has not yet selected the machine. Matched: Between claimed and unclaimed. Preempting: Condor is busy removing a job.
14
www.cs.wisc.edu/condor More Examples % condor_status -help % condor_status –avail % condor_status –run % condor_status –total % condor_status –pool condor.cs.wisc.edu
15
www.cs.wisc.edu/condor Submitting Jobs
16
www.cs.wisc.edu/condor Steps to Running a Job › Re-link for Condor. › Submit the job. › Watch the progess. › Receive email when done.
17
www.cs.wisc.edu/condor Example Job Compute the nth Fibonnaci number. Fib(40) takes about one minute to compute on an Alpha. %./fib 40 fib(40) = 102334155
18
www.cs.wisc.edu/condor #include int fib( int x ) { if( x<=0 ) return 0; if( x==1 ) return 1; return fib(x-1) + fib(x-2); } int main(int argc, char *argv[]) { int n; n = atoi(argv[1]); printf ("fib(%d) = %d\n",n,fib(n)); return 0; }
19
www.cs.wisc.edu/condor Re-link for Condor › Normal compile: gcc –c fib.c –o fib.o › Normal link: gcc fib.o –o fib › Use the same command, but add condor_compile: condor_compile gcc fib.o –o fib
20
www.cs.wisc.edu/condor Submit the Job › Create a submit file: vi fib.submit › Submit the job: condor_submit fib.submit Executable = fib Arguments = 40 Output = fib.out Log = fib.log queue
21
www.cs.wisc.edu/condor Watch the Progress % condor_q -- Submitter: axpbo8.bo.infn.it : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 thain 6/21 12:40 0+00:00:15 R 0 2.5 fib 40 Each job gets a unique number. Status: Unexpanded, Running or Idle Size of program image (MB)
22
www.cs.wisc.edu/condor Receive E-mail When Done This is an automated email from the Condor system on machine "axpbo8.bo.infn.it". Do not reply. Your condor job /tmp_mnt/usr/users/ccl/thain/test/fib 40 exited with status 0. Submitted at: Wed Jun 21 14:24:42 2000 Completed at: Wed Jun 21 14:36:36 2000 Real Time: 0 00:11:54 Run Time: 0 00:06:52 Committed Time: 0 00:01:37...
23
www.cs.wisc.edu/condor Running Many Processes › 100 processes are almost as easy as !. › Each condor_submit makes one cluster of one or more processes. › Add the number of processes to run to the Queue statement. › Use the $(PROCESS) variable to give each process slightly different instructions.
24
www.cs.wisc.edu/condor Running Many Processes › Compute Fib(1) through Fib(50) › Output goes in fib.out.1, fib.out.2, and so on… Executable = fib Arguments = $(PROCESS) Output = fib.out.$(PROCESS) Log = fib.log Queue 50
25
www.cs.wisc.edu/condor Running Many Processes › Another approach: Each process gets its own directory (dir1, dir2, …) and sends the output to dirX/fib.out. Executable = fib Arguments = $(PROCESS) Initial_Dir = dir$(PROCESS) Output = fib.out Log = fib.log Queue 50
26
www.cs.wisc.edu/condor Running Many Processes % condor_q -- Submitter: axpbo8.bo.infn.it : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.3 thain 6/23 10:47 0+00:05:40 R 0 2.5 fib 3 9.6 thain 6/23 10:47 0+00:05:11 R 0 2.5 fib 6 9.7 thain 6/23 10:47 0+00:05:09 R 0 2.5 fib 7... 21 jobs; 2 idle, 19 running, 0 held Cluster number Process number
27
www.cs.wisc.edu/condor Where Are They Running? › condor_q –run - Submitter: axpbo8.bo.infn.it : : ID OWNER SUBMITTED RUN_TIME HOST(S) 9.47 thain 6/23 10:47 0+00:07:03 ax4bbt.bo.infn.it 9.48 thain 6/23 10:47 0+00:06:51 pewobo1.bo.infn.it 9.49 thain 6/23 10:47 0+00:06:30 osde01.pd.infn.it Current Location
28
www.cs.wisc.edu/condor Help! I’m buried in Email! › By default, Condor sends one email for each completed process. › Add these to your submit file: notification = error notification = never › To send it to someone else: notify_user = mazzanti@bo.infn.it
29
www.cs.wisc.edu/condor Removing Processes › Remove one process: condor_rm 9.47 › Remove a whole cluster: condor_rm 9 › Remove everything! condor_rm -a
30
www.cs.wisc.edu/condor Getting Feedback
31
www.cs.wisc.edu/condor What have I done? › The user log file (fib.log) shows a chronological list of everything important that happened to a job. 001 (007.035.000) 06/21 17:03:44 Job executing on host: 004 (007.035.000) 06/21 17:04:58 Job was evicted. 009 (007.035.000) 06/21 17:05:10 Job was aborted by the user.
32
www.cs.wisc.edu/condor What have I done? % condor_history ID OWNER SUBMITTED CPU_USAGE ST COMPLETED CMD 9.3 thain 6/23 10:47 0+00:00:00 C 6/23 10:58 fib 3 9.40 thain 6/23 10:47 0+00:00:24 C 6/23 10:59 fib 40 9.10 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 10 9.47 thain 6/23 10:47 0+00:05:45 C 6/23 11:01 fib 47 9.7 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 7
33
www.cs.wisc.edu/condor Brief I/O Summary % condor_q –io -- Schedd: c01.cs.wisc.edu : ID OWNER READ WRITE SEEK XPUT BUFSIZE BLKSIZE 756.15 joe 244.9 KB 379.8 KB 71 1.3 KB/s 512.0 KB 32.0 KB 758.24 joe 198.8 KB 219.5 KB 78 45.0 B /s 512.0 KB 32.0 KB 758.26 joe 44.7 KB 22.1 KB 2727 13.0 B /s 512.0 KB 32.0 KB 3 jobs; 0 idle, 3 running, 0 held
34
www.cs.wisc.edu/condor Complete I/O Summary in Email Your condor job "/usr/joe/records.remote input output" exited with status 0. Total I/O: 104.2 KB/s effective throughput 5 files opened 104 reads totaling 411.0 KB 316 writes totaling 1.2 MB 102 seeks I/O by File: buffered file /usr/joe/input opened 2 times 100 reads totaling 398.6 KB 311 write totaling 1.2 MB 101 seeks (Only since Condor Version 6.1.11)
35
www.cs.wisc.edu/condor Complete I/O Summary in Email › The summary helps identify performance problems. Even advanced users don't know exactly how their programs and libraries operate.
36
www.cs.wisc.edu/condor Complete I/O Summary in Email › Example: CMSSIM - physics analysis program. “Why is this job so slow?” Data summary: read 250 MB from 20 MB file. Very high SEEK total -> random access. Solution: Increase buffer to 20 MB.
37
www.cs.wisc.edu/condor Who Uses Condor? % condor_q –global -- Schedd: to02xd.to.infn.it : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 127.0 garzelli 6/21 18:45 1+14:18:16 R 0 17.2 tosti2trisdn -- Schedd: quark.ts.infn.it : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 600.0 dellaric 4/10 14:57 55+09:20:31 R 0 9.1 john p2.dat 665.0 dellaric 6/2 11:14 20+03:27:30 R 0 9.2 john p1.dat 788.0 pamela 6/20 09:27 3+04:41:43 R 0 15.4 montepamela
38
www.cs.wisc.edu/condor Who uses Condor? % condor_status –submitters Name Machine Running IdleJobs MaxJobsRunning rebuzzin@pv.infn.it decux1.pv. 22 34 200 pamela@ts.infn.it quark.ts.i 6 1 200 giunti@to.infn.it to05xd.to. 21 49 200... RunningJobs IdleJobs cattaneo@pv.infn.it 0 1 pamela@ts.infn.it 6 1 rebuzzin@pv.infn.it 22 34 Total 59 86
39
www.cs.wisc.edu/condor Who Uses Condor? % condor_userprio Last Priority Update: 6/23 16:27 Effective User Name Priority ------------------------------ --------- meucci@pv.infn.it 0.50 longof@ts.infn.it 0.50 thain@bo.infn.it 0.50 dellaric@ts.infn.it 2.00 clueoff@pd.infn.it 3.00 pamela@ts.infn.it 5.81 rebuzzin@pv.infn.it 18.18 giunti@to.infn.it 19.72 ------------------------------ --------- Number of users shown: 8
40
www.cs.wisc.edu/condor Who Uses Condor? › The user priority is computed by Condor to estimate how much of the pool’s CPU resources have been used by each submitter. › Lighter users receive a lower priority: they will be allocated CPUs before heavy users. › Users consuming the same amount of CPU will be allocated an equal amount.
41
www.cs.wisc.edu/condor Measuring Goodput › Goodput is the amount of time a workstation spends making forward progress on work assigned by Condor. › This is a big topic all by itself: http://www.cs.wisc.edu/condor/goodput http://www.cs.wisc.edu/condor/goodput
42
www.cs.wisc.edu/condor Measuring Goodput % condor_q –goodput -- Submitter: coral.cs.wisc.edu : : coral.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME GOODPUT CPU_UTIL Mb/s 719.74 thain 6/23 07:35 2+20:47:59 100.0% 87.6% 0.00 719.75 thain 6/23 07:35 2+20:38:45 40.5% 99.8% 0.00 719.76 thain 6/23 07:35 2+20:38:16 96.9% 98.7% 0.00 719.77 thain 6/23 07:35 2+21:10:06 100.0% 99.8% 0.00
43
www.cs.wisc.edu/condor Setting Requirements with ClassAds
44
www.cs.wisc.edu/condor Setting Requirements › We believe that Condor must allow both users (jobs) and owners (machines) to set requirements. › This is an absolute necessity in order to convince people to participate in the community.
45
www.cs.wisc.edu/condor ClassAds › ClassAds are a simple language for describing both the properties and the requirements of jobs and machines. › Condor stores nearly everything in ClassAds -- use the –l option to condor_q and condor_submit to get the full details.
46
www.cs.wisc.edu/condor ClassAd for a Machine › condor_status –l axpbo8 MyType = "Machine" TargetType = "Job" Name = "axpbo8.bo.infn.it" START = TRUE VirtualMemory = 342696 Disk = 28728536 Memory = 160 Cpus = 1 Arch = "ALPHA" OpSys = "OSF1“
47
www.cs.wisc.edu/condor ClassAd for a Job › condor_q –l 9.49 MyType = "Job" TargetType = "Machine" Owner = "thain" Cmd = "/tmp_mnt/usr/users/ccl/thain/test/fib" Out = “fib.out.49” Args = “49” ImageSize = 2544 DiskUsage = 2544 Requirements = (Arch == "ALPHA") && (OpSys == "OSF1") && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)
48
www.cs.wisc.edu/condor Default Requirements › By default, Condor assumes the requirements for your job are: “I need a machine with…” The same operating system and architecture as my workstation. Enough disk to store the program. Enough virtual memory to run the program.
49
www.cs.wisc.edu/condor Default Requirements › Expressed in ClassAds as: Requirements = (Arch ==“ALPHA”) && (OpSys==“OSF1”) && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)
50
www.cs.wisc.edu/condor ClassAd Requirements › Similar to C/C++/Java expressions: Symbols: Arch, OpSys, Memory, Mips Values: 15, 6.5, “LINUX” Operators: ==,, = &&, || ( )
51
www.cs.wisc.edu/condor Adding Requirements › In the submit file, add a line beginning with “requirements = “ Executable = fib Arguments = 40 Output = fib.out Log = fib.log Requirements = (Memory > 64) queue
52
www.cs.wisc.edu/condor Example Requirements › (Memory>64) › (Machine == “axpbo3.bo.infn.it” ) › (Mips>100) || (Kflops>10000) › (Subnet != “131.154.10”) && (Disk > 20000000)
53
www.cs.wisc.edu/condor Are the Requirements Reasonable? › Two ways to find out: Before running, use condor_status to list all machines matching certain requirements. While running, use condor_analyze to see if a match is possible.
54
www.cs.wisc.edu/condor Are the Requirements Reasonable? % condor_status –constraint ‘(Memory>640)’ Only axpd30. % condor_status –constraint ‘(Memory>512)’ Five machines: ax4mcs, axpd30, axppv3, axzds0, and stonehenge.
55
www.cs.wisc.edu/condor Are the Requirements Reasonable? › Suppose that I submit a job like this: › “My job isn’t running – Why?” Executable = fib Arguments = 40 Output = fib.out Requirements = ( Mips > 5000 ) queue
56
www.cs.wisc.edu/condor Are the Requirements Reasonable? % condor_q –analyze WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = ((Mips > 5000)) && (Arch == "ALPHA") && (OpSys == "OSF1") && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)
57
www.cs.wisc.edu/condor Preferences › Condor assumes that any machines that match your requirements are suitable. › However, you may prefer some machines over others. (100 Mips is better than 10) › To indicate a preference, you may provide a ClassAd expression which ranks all matches.
58
www.cs.wisc.edu/condor Rank › The rank expression is evaluated into a number for every potential matching machine. › A machine with a higher number will be preferred over a machine with a lower number.
59
www.cs.wisc.edu/condor Rank Examples › Prefer machines with more Mips: Rank = Mips › Prefer more memory, but add 100 to the rank if the machine is Solaris 2.7: Rank = Memory + 100*(OpSys==“SOLARIS27)” › Prefer machines with a high ratio of memory to cpu performance: Rank = Memory/Mips › Prefer machines that will checkpoint in Bologna: Rank = (CkptServer==“ckpt.bo.infn.it”)
60
www.cs.wisc.edu/condor Use MORE Machines! › The Condor pool has several architectures: 115 Alpha/OSF1 62 Intel/Linux 11 Sun4u/Solaris › To get maximum throughput, you must use all that are available. Be greedy!
61
www.cs.wisc.edu/condor Compile for Each System. › Make an executable for each kind of system you wish to use. Give each a unique name. › On an Alpha/OSF1: condor_compile gcc fib.c –o fib.ALPHA.OSF1 › On an Intel/Linux: condor_compile gcc fib.c –o fib.INTEL.LINUX
62
www.cs.wisc.edu/condor Change the Submit File Executable = fib.$$Arch.$$Opsys Requirements = ( ((Arch==“ALPHA”) && (OpSys==“OSF1”)) || ((Arch==“INTEL”) && (OpSys==“LINUX”)) ) Make the executable name a function of the machine selected. Allow either ALPHA/OSF1 or INTEL/LINUX machines to be selected.
63
www.cs.wisc.edu/condor Condor Will Decide at the Last Minute Alpha/OSF1 You Intel/Linux Fib.$$Arch.$$Opsys Fib.ALPHA.OSF1 Fib.INTEL.LINUXFib.$$Arch.$$Opsys
64
www.cs.wisc.edu/condor Standard or Vanilla?
65
www.cs.wisc.edu/condor Which Universe? › Each Condor universe provides different services to different kinds of programs: Standard – Relinked UNIX programs Vanilla – Unmodified UNIX programs PVM Scheduler (Not described here) Globus
66
www.cs.wisc.edu/condor Which Universe?Cluster File Server Cluster File Server VANILLA STANDARD
67
www.cs.wisc.edu/condor Standard Universe › Submit a specially-linked UNIX application to the Condor system. › Advantages: Checkpointing for fault tolerance. Remote I/O services: Friendly environment anywhere in the world. Data buffering and staging. I/O performance feedback. User remapping of data sources.
68
www.cs.wisc.edu/condor Standard Universe › Disadvantages: Must statically link with Condor library. Limited class of applications: Single-process UNIX binaries. Certain system calls prohibited.
69
www.cs.wisc.edu/condor System Call Limitations › Standard universe does not allow: Multiple processes: fork(), exec(), system() Inter-process communication: semaphores, messages, shared memory Complex I/O: mmap(), select(), poll(), non-blocking I/O, … Kernel-level threads (User level threads are OK.)
70
www.cs.wisc.edu/condor System Call Limitations › Too restrictive? Use the vanilla universe.
71
www.cs.wisc.edu/condor Vanilla Universe › Submit any sort of UNIX program to the Condor system. › Advantages: No relinking required. Any program at all, including Binaries Shell scripts Interpreted programs (java, perl) Multiple processes
72
www.cs.wisc.edu/condor Vanilla Universe › Disadvantages: No checkpointing. Very limited remote I/O services. Specify input files explicitly. Specify output files explicitly. Condor will refuse to start a vanilla job on a machine that is unfriendly. ClassAds: FilesystemDomain and UIDDomain
73
www.cs.wisc.edu/condor Which Universe? › Standard: Good for mixed Condor pools, flocked pools, and the Grid at large. › Vanilla: Good for a Condor pool of identical machines.
74
www.cs.wisc.edu/condor Conclusion
75
www.cs.wisc.edu/condor Conclusion › Condor expands your reach to many CPUs – even those you cannot log in to. › Condor makes it easy to run and manage large numbers of jobs › Good candidates for the standard universe are single-process CPU-bound jobs with simple I/O. › Too restrictive? Use the vanilla universe, but fewer available machines.
76
www.cs.wisc.edu/condor Conclusion › Need more info? › Douglas Thain (thain@cs.wisc.edu)thain@cs.wisc.edu › INFN CCL (ccl@bo.infn.it)ccl@bo.infn.it › Condor Web Page (http://www.cs.wisc.edu/condor)http://www.cs.wisc.edu/condor › This talk: (http://www.cs.wisc.edu/~thain)http://www.cs.wisc.edu/~thain
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.