Download presentation
Presentation is loading. Please wait.
Published byConrad Barnett Modified over 9 years ago
1
Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu
2
2 Conventions Used In This Presentation A slide with an all-yellow background is the beginning of a new “chapter” The slides after it will describe each entry on the yellow slide in great detail A Condor tool that users would use will be in red italics A ClassAd attribute name will be in blue A UNIX shell command or file name will be in courier font
3
3 What is Condor? A system for “High-Throughput Computing” Lots of jobs over a long period of time, not a short burst of “high-performance” Condor manages both resources (machines) and resource requests (jobs) Supports additional features for jobs that are re-linked with Condor libraries: checkpointing remote system calls
4
4 What’s Condor Good For? Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and sends you email when they complete Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. Condor can handle inter-job dependencies (DAGMan)
5
5 What’s Condor Good For? (cont’d) Robustness Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion If an execute machine crashes, you only loose work done since the last checkpoint Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover
6
6 What’s Condor Good For? (cont’d) Giving you access to more computing resources Checkpointing allows your job to run on “opportunistic resources” (not dedicated) Checkpointing also provides “migration” - if a machine is no longer available, move! With remote system calls, you don’t even need an account on a machine where your job executes
7
7 What is a Condor Pool? “Pool” can be a single machine, or a group of machines Determined by a “central manager” - the matchmaker and centralized information repository Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself
8
8 What Kind of Job Do You Have? You must know some things about your job to decide if and how it will work with Condor: What kind of I/O does it do? Does it use TCP/IP? (network sockets) Can the job be resumed? Is the job multi-process (fork(), pvm_addhost(), etc.)
9
9 What Kind of I/O Does Your Job Do? Interactive TTY “Batch” TTY (just reads from STDIN and writes to STDOUT or STDERR, but you can redirect to/from files) X Windows NFS, AFS, or another network file system Local file system TCP/IP
10
10 What Does Condor Support? Condor can support various combinations of these features in different “Universes” Different Universes provide different functionality for your job: Vanilla Standard Scheduler PVM
11
11 What Does Condor Support?
12
12 Condor Universes A Universe specifies a Condor runtime environment: STANDARD –Supports Checkpointing –Supports Remote System Calls –Has some limitations ( no fork(), socket(), etc.) VANILLA –Any Unix executable (shell scripts, etc) –No Condor Checkpointing or Remote I/O
13
13 Condor Universes (cont’d) PVM (Parallel Virtual Machine) –Allows you to run parallel jobs in Condor (more on this later) SCHEDULER –Special kind of Condor job: the job is run on the submit machine, not a remote execute machine –Job is automatically restarted is the condor_schedd is shutdown –Used to schedule jobs (e.g. DAGMan)
14
14 Submitting Jobs to Condor Choosing a “Universe” for your job (already covered this) Preparing your job Making it “batch-ready” Re-linking if checkpointing and remote system calls are desired (condor_compile) Creating a submit description file Running condor_submit Sends your request to the User Agent (condor_schedd)
15
15 Preparing Your Job Making your job “batch-ready” Must be able to run in the background: no interactive input, windows, GUI, etc. Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices If your job expects input from the keyboard, you have to put the input you want into a file
16
16 Preparing Your Job (cont’d) If you are going to use the standard universe with checkpointing and remote system calls, you must re-link your job with Condor’s special libraries To do this, you use condor_compile Place “condor_compile” in front of the command you normally use to link your job: condor_compile gcc -o myjob myjob.c
17
17 Creating a Submit Description File A plain ASCII text file Tells Condor about your job: Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.
18
18 Example Submit Description File # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = standard Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_1 Queue
19
19 Example Submit Description File Described Submits a single job to the standard universe, specifies files for STDIN, STDOUT and STDERR, creates a UserLog defines command line arguments, and specifies the directory the job should be run in Equivalent to (for outside of Condor): % cd /home/wright/condor/run_1 % /home/wright/condor/my_job.condor -arg1 -arg2 \ > my_job.stdout 2> my_job.stderr \ < my_job.stdin
20
20 “Clusters” and “Processes” If your submit file describes multiple jobs, we call this a “cluster” Each job within a cluster is called a “process” or “proc” If you only specify one job, you still get a cluster, but it has only one process A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”) Process numbers always start at 0
21
21 Example Submit Description File for a Cluster # Example condor_submit input file that defines # a whole cluster of jobs at once Universe = standard Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_$(Process) Queue 500
22
22 Example Submit Description File for a Cluster - Described Now, the initial directory for each job is specified with the $(Process) macro, and instead of submitting a single job, we use “Queue 500” to submit 500 jobs at once $(Process) will be expaned to the process number for each job in the cluster (from 0 up to 499 in this case), so we’ll have “run_0”, “run_1”, … “run_499” directories All the input/output files will be in different directories!
23
23 Running condor_submit You give condor_submit the name of the submit file you have created condor_submit parses the file and creates a “ClassAd” that describes your job(s) Creates the files you specified for STDOUT and STDERR Sends your job’s ClassAd(s) and executable to the condor_schedd, which stores the job in its queue
24
24 Monitoring Your Jobs Using condor_q Using a “User Log” file Using condor_status Using condor_rm Getting email from Condor Once they complete, you can use condor_history to examine them
25
25 Using condor_q To view the jobs you have submitted, you use condor_q Displays the status of your job, how much compute time it has accumulated, etc. Many different options: A single job, a single cluster, all jobs that match a certain constraint, or all jobs Can view remote job queues (either individual queues, or “-global”)
26
26 Using a “User Log” file A UserLog must be specified in your submit file: Log = filename You get a log entry for everything that happens to your job: When it was submitted, when it starts executing, if it is checkpointed or vacated, if there are any problems, etc. Very useful! Highly recommended!
27
27 Using condor_status To view the status of the whole Condor pool, you use condor_status Can use the “-run” option to see which machines are running jobs, as well as: The user who submitted each job The machine they submitted from Can also view the status of various submitters with “-submitter ”
28
28 Using condor_rm If you want to remove a job from the Condor queue, you use condor_rm You can only remove jobs that you own (you can’t run condor_rm on someone else’s jobs unless you are root) You can give specific job ID’s (cluster or cluster.proc), or you can remove all of your jobs with the “-a” option.
29
29 Getting Email from Condor By default, Condor will send you email when your jobs completes If you don’t want this email, put this in your submit file: notification = never If you want email every time something happens to your job (checkpoint, exit, etc), use this: notification = always
30
30 Getting Email from Condor (cont’d) If you only want email if your job exits with an error, use this: notification = error By default, the email is sent to your account on the host you submitted from. If you want the email to go to a different address, use this: notify_user = email@address.here
31
31 Using condor_history Once your job completes, it will no longer show up in condor_q Now, you must use condor_history to view the job’s ClassAd The status field (“ST”) will have either a “C” for “completed”, or an “X” if the job was removed with condor_rm
32
32 Any questions? Nothing is too basic If I was unclear, you probably are not the only person who doesn’t understand, and the rest of the day will be even more confusing
33
Hands-On Exercise #1 Submitting and Monitoring a Simple Test Job
34
34 Hands-On Exercise #1 Login to your machine as user “condor” You will see two windows: Netscape, with instructions An xterm, where you execute commands To begin, click on Simple Test Job Please follow the directions carefully Any lines beginning with % are commands that you should execute in your xterm If you accidentally exit Netscape, click on “Tutorial” in the Start menu
35
Lunch break Please be back by 13:30
36
Welcome Back
37
37 Classified Advertisements ClassAds Language for expressing attributes Semantics for evaluating them Intuitively, a ClassAd is a set of named expressions Each named expression is an attribute Expressions are similar to C … Constants, attribute references, operators
38
38 Classified Advertisements: Example MyType = "Machine" TargetType = "Job" Name = "froth.cs.wisc.edu" StartdIpAddr=" " Arch = "INTEL" OpSys = "SOLARIS26" VirtualMemory = 225312 Disk = 35957 KFlops = 21058 Mips = 103 LoadAvg = 0.011719 KeyboardIdle = 12 Cpus = 1 Memory = 128 Requirements = LoadAvg 15 * 60 Rank = 0
39
39 Classified Advertisements: Matching ClassAds are always considered in pairs: Does ClassAd A match ClassAd B (and vice versa)? This is called “2-way matching” If the same attribute appears in both ClassAds, you can specify which attribute you mean by putting “MY.” or “TARGET.” in front of the attribute name
40
40 Classified Advertisements: Examples ClassAd A MyType = "Apartment" TargetType = "ApartmentRenter" SquareArea = 3500 RentOffer = 1000 HeatIncluded = False OnBusLine = True Rank = UnderGrad==False + TARGET.RentOffer Requirements = MY.RentOffer - TARGET.RentOffer < 150 ClassAd B MyType = "ApartmentRenter" TargetType = "Apartment" UnderGrad = False RentOffer = 900 Rank = 1/(TARGET.RentOffer + 100.0) + 50*HeatIncluded Requirements = OnBusLine && SquareArea > 2700
41
41 ClassAds in the Condor System ClassAds allow Condor to be a general system Constraints and ranks on matches expressed by the entities themselves Only priority logic integrated into the Match-Maker All principal entities in the Condor system are represented by ClassAds Machines, Jobs, Submitters
42
42 ClassAds in Condor: Requirements and Rank (Example for Machines) Friend = Owner == "tannenba" || Owner == "wright" ResearchGroup = Owner == "jbasney" || Owner == "raman" Trusted = Owner != "rival" && Owner != "riffraff" Requirements = Trusted && ( ResearchGroup || (LoadAvg 15*60) ) Rank = Friend + ResearchGroup*10
43
43 Requirements for Machine Example Described Machine will never start a job submitted by “rival” or “riffraff” If someone from ResearchGroup (“jbasney” or “raman”) submits a job, it will always run, regardless of keyboard activity or load average If anyone else submits a job, it will only run here if the keyboard has been idle for more than 15 minutes and the load average is less than 0.3
44
44 Machine Rank Example Described If the machine is running a job submitted by owner “foo”, it will give this a Rank of 0, since foo is neither a friend nor in the same research group If “wright” or “tannenba” submits a job, it will be ranked at 1 (since Friend will evaluate to 1 and ResearchGroup is 0) If “raman” or “jbasney” submit a job, it will have a rank of 10 While a machine is running a job, it will be preempted for a higher ranked job
45
45 ClassAds in Condor: Requirements and Rank (Example for Jobs) Requirements = Arch == “INTEL” && OpSys == “LINUX” && Memory > 20 Rank = (Memory > 32) * ( (Memory * 100) + (IsDedicated * 10000) + Mips )
46
46 Job Example Described The job must run on an Intel CPU, running Linux, with at least 20 megs of RAM All machines with 32 megs of RAM or less are Ranked at 0 Machines with more than 32 megs of RAM are ranked according to how much RAM they have, if the machine is dedicated (which counts a lot to this job!), and how fast the machine is, as measured in Million Instructions Per Second
47
47 Finding and Using the ClassAd Attributes in your Pool Condor defines a number of attributes by default, which are listed in the User Manual (“About Requirements and Rank”) To see if machines in your pool have other attributes defined, use: condor_status -long A custom-defined attribute might not be defined on all machines in your pool, so you’ll probably want to use “meta- operators”
48
48 ClassAd “Meta-Operators” Meta operators allow you to compare against “UNDEFINED” as if it were a real value: =?= is “meta-equal-to” =!= is “meta-not-equal-to” Color != “Red” (non-meta) would evaluate to UNDEFINED if Color is not defined Color =!= “Red” would evaluate to True if Color is not defined, since UNDEFINED is not “Red”
49
Hands-On Exercise #2 Submitting Jobs with Requirements and Rank
50
50 Hands-On Exercise #2 Please point your browser to the new instructions: Go back to the tutorial homepage Click on Requirements and Rank Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm If you exited Netscape, just click on “Tutorial” from your Start menu
51
51 Priorities In Condor Two kinds of priorities: User Priorities –Priorities between users in the pool to ensure fairness –The lower the value, the better the priority Job Priorities –Priorities that users give to their own jobs to determine the order in which they will run –The higher the value, the better the priority –Only matters within a given user’s jobs
52
52 User Priorities in Condor Each active user in the pool has a user priority Viewed or changed with condor_userprio The lower the number, the better A given user’s share of available machines is inversely related to the ratio between user priorities. Example: Fred’s priority is 10, Joe’s is 20. Fred will be allocated twice as many machines as Joe.
53
53 User Priorities in Condor, cont. Condor continuously adjusts user priorities over time machines allocated > priority, priority worsens machines allocated < priority, priority improves Priority Preemption Higher priority users will grab machines away from lower priority users (thanks to Checkpointing…) Starvation is prevented Priority “thrashing” is prevented
54
54 Job Priorities in Condor Can be set at submit-time in your description file with: prio = Can be viewed with condor_q Can be changed at any time with condor_prio The higher the number, the more likely the job will run (only among the jobs of an individual user)
55
55 Managing a Large Cluster of Jobs Condor can manage huge numbers of jobs Special features of the submit description file make this easier Condor can also manage inter-job dependencies with condor_dagman For example: job A should run first, then, run jobs B and C, when those finish, submit D, etc… We’ll discuss DAGMan later
56
56 Submitting a Large Cluster Anywhere in your submit file, if you use $(Process), that will expand to the process number of each job in the cluster: input = my_input.$(process) arguments = $(process) It is common to use $(Process) to specify InitialDir, so that each process runs in its own directory: InitialDir = dir.$(process)
57
57 Submitting a Large Cluster (cont’d) Can either have multiple Queue entries, or put a number after Queue to tell Condor how many to submit: Queue 1000 A cluster is more efficient: Your jobs will run faster, and they’ll use less space Can only have one executable per cluster: Different executables must be different clusters!
58
Hands-On Exercise #3 Submitting a Large Cluster of Jobs
59
59 Hands-On Exercise #3 Please point your browser to the new instructions: Go back to the tutorial homepage Click on Large Clusters Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm If you exited Netscape, just click on “Tutorial” from your Start menu
60
10 Minute Break Questions are welcome….
61
61 Inter-Job Dependencies with DAGMan DAGMan can be used to handle a set of jobs that must be run in a certain order Also provides “pre” and “post” operations, so you can have a program or script run before each job is submitted and after it completes Robust: handles errors and submit-machine crashes
62
62 Using DAGMan You define a DAG description file, which is similar in function to the submit file you give to condor_submit DAGMan restrictions: Each job in the DAG must be in its own cluster (this is a limitation we will remove in future versions) All jobs in the DAG must have a User Log and must share the same file
63
63 Format of the DAGMan Description File # is a comment First section names the jobs in your DAG and associates a submit description file with each job Second (optional) section defines PRE and POST scripts to run Final section defines the job dependencies
64
64 Example DAGMan Description File # Example DAGMan input file Job A A.submit Job B B.submit Job C C.submit Job D D.submit Script PRE D d_input_checker Script POST A a_output_processor A.out PARENT A CHILD B C PARENT B C CHILD D
65
65 Setting up a DAG for Condor Must create the DAG description file Must create all the submit description files for the individual jobs Must prepare any executables you plan to use If you want, you can have a mix of Vanilla and Standard jobs Must setup any PRE/POST commands or scripts you wish to use
66
66 Submitting a DAG to Condor Once you have everything in place, to submit a DAG, you use condor_submit_dag and give it the name of your DAG description file This will check your input file for errors and submit a copy of condor_dagman as a scheduler universe job with all the necessary command-line arguments
67
67 Removing a DAG Removing a DAG is easy: Just use on the scheduler universe job (condor_dagman) On shutdown, DAGMan will remove any jobs that are currently in the queue that are associated with its DAG Once all jobs are gone, DAGMan itself will exit, and the scheduler universe job will be removed from the queue
68
Hands-On Exercise #4 Using DAGMan
69
69 Hands-On Exercise #4 Please point your browser to the new instructions: Go back to the tutorial homepage Click on Using_DAGMan Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm If you exited Netscape, just click on “Tutorial” from your Start menu
70
70 What’s Wrong with my Vanilla Job? Special requirements expressions for vanilla jobs You didn’t submit it from a directory that is shared Condor isn’t running as root (more on this later) You don’t have your file permissions setup correctly (more on this later)
71
71 Special Requirements Expressions for Vanilla Jobs When you submit a vanilla job, Condor automatically appends two extra Requirements: UID_DOMAIN == FILESYSTEM_DOMAIN == Since there are no remote system calls with Vanilla jobs, they depend on a shared file system and a common UID space to run as you and access your files
72
72 Special Requirements Expressions for Vanilla Jobs By default, each machine in your pool is in its own UID_DOMAIN and FILESYSTEM_DOMAIN, so your pool administrator has to configure your pool specially if there really is a common UID space and a network file system If you don’t have an account on the remote system, Vanilla jobs won’t work
73
73 Shared File Systems for Vanilla Jobs Just because you have AFS or NFS doesn’t mean ALL files are shared Initialdir = /tmp will probably cause trouble for Vanilla jobs! You must be sure to set Initialdir to a shared directory (or cd into it to run condor_submit) for Vanilla jobs
74
74 Why Don’t My Jobs Run? Try using condor_q -analyze Try specify a User Log for your job Look at condor_userprio: maybe you have a bad priority and higher priority users are being served Problems with file permissions or network file systems Look at the SchedLog
75
75 Using condor_q -analyze condor_q -analyze will analyze your job’s ClassAd, get all the ClassAds of the machines in the pool, and tell you what’s going on: Will report errors in your Requirements expression (impossible to match, etc.) Will tell you about user priorities in the pool (other people have better priority)
76
76 Looking at condor_userprio You can look at condor_userprio yourself If your priority value is a really high number (because you’ve been running a lot of Condor jobs), other users will have priority to run jobs in your pool
77
77 File Permissions in Condor If Condor isn’t running as root, the condor_shadow process runs as the user the condor_schedd is running as (usually “condor”) You must grant this user write access to your output files, and read access to your input files (both STDOUT, STDIN from your submit file, as well as files your job explicitly opens)
78
78 File Permissions in Condor (cont’d) Often, there will be a “condor” group and you can make your files owned and write- able by this group For vanilla jobs, even if the UID_DOMAIN setting is correct, and they match for your submit and execute machines, if Condor isn’t running as root, your job will be started as user Condor, not as you!
79
79 Problems with NFS in Condor For NFS, sometimes the administrators will setup read-only mounts, or have UIDs remapped for certain partitions (the classic example is root = nobody, but modern NFS can do arbitrary remappings)
80
80 Problems with NFS in Condor (cont’d) If your pool uses NFS automounting, the directory that Condor thinks is your InitialDir (the directory you were in when you ran condor_submit) might not exist on a remote machine E.g. you’re in /mnt/tmp/home/me/... With automounting, you always need to specify InitialDir explicitly InitialDir = /home/me/...
81
81 Problems with AFS in Condor If your pool uses AFS, the condor_shadow, even if it’s running with your UID, will not have your AFS token You must grant an unauthenticated AFS user the appropriate access to your files Some sites provide a better alternative that world-writable files –Host ACLs –Network-specific ACLs
82
82 Looking at the SchedLog Looking at the log file of the condor_schedd, the “SchedLog” file can possibly give you a clue if there are problems Find it with: condor_config_val schedd_log You might need your pool administrator to turn on a higher “debugging level” to see more verbose output
83
83 Other User Features Submit-Only installation Heterogeneous Submit PVM jobs
84
84 Submit-Only Installation Can install just a condor_master and condor_schedd on your machine Can submit jobs into a remote pool Special option to condor_install
85
85 Heterogeneous Submit The job you submit doesn’t have to be the same platform as the machine you submit from Maybe you have access to a pool that’s full of Alphas, but you have a Sparc on your desk, and moving all your data is a pain You can take an Alpha binary, copy it to your Sparc, and submit it with a requirements expression that says you need to run on ALPHA/OSF1
86
86 Parallel Jobs in Condor Condor can run parallel applications Written to the popular PVM message passing library Future work includes support for MPI Master-Worker Paradigm What does Condor-PVM do? How to compile and submit Condor-PVM jobs
87
87 Master-Worker Paradigm Condor-PVM is designed to run PVM applications which follow the master-worker paradigm. Master has a pool of work, sends pieces of work to the workers, manages the work and the workers Worker gets a piece of work, does the computation, sends the result back
88
88 What does Condor-PVM do? Condor acts as the PVM resource manager. All pvm_addhost requests get re-mapped to Condor. Condor dynamically constructs PVM virtual machines out of non-dedicated desktop machines. When a machine leaves the pool, the user gets notified via the normal PVM notification mechanisms.
89
89 How to compile and submit Condor-PVM jobs Binary Compatible Compile and link with PVM library just as normal PVM applications. No need to link with Condor. Submit In the submit description file, set: universe = PVM machine_count =..
90
90 Obtaining Condor Condor can be downloaded from the Condor web site at: http://www.cs.wisc.edu/condor Complete Users and Administrators manual available http://www.cs.wisc.edu/condor/manual Contracted Support is available Questions? Email: condor-admin@cs.wisc.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.