Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison

Similar presentations


Presentation on theme: "Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison"— Presentation transcript:

1 Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

2 2 Conventions Used In This Presentation  A slide with an all-yellow background is the beginning of a new “chapter” The slides after it will describe each entry on the yellow slide in great detail  A Condor tool that users would use will be in red italics  A ClassAd attribute name will be in blue  A UNIX shell command or file name will be in courier font

3 3 What is Condor?  A system for “High-Throughput Computing”  Lots of jobs over a long period of time, not a short burst of “high-performance”  Condor manages both resources (machines) and resource requests (jobs)  Supports additional features for jobs that are re-linked with Condor libraries: checkpointing remote system calls

4 4 What’s Condor Good For?  Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and sends you email when they complete Condor maintains a persistent job queue Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion

5 5 What’s Condor Good For? (continued)  Managing a large number of machines Condor daemons run on all the machines in your pool and are constantly monitoring machine state You can query Condor for information about your machines Condor handles all background jobs in your pool with minimal impact on your machine owners

6 6 What is a Condor Pool?  “Pool” can be a single machine, or a group of machines  Determined by a “central manager” - the matchmaker and centralized information repository  Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself

7 7 The Condor Daemons  condor_master (controls everything else)  condor_startd (executing jobs) condor_starter (helper for starting jobs)  condor_schedd (submitting jobs) condor_shadow (submit-side helper)  condor_collector (only on Central Manager)  condor_negotiator (only on CM)  You only have to run the daemon(s) for the service(s) you want to provide

8 8 condor_master  Starts up all other Condor daemons  If there are any problems and a daemon exists, it restarts the daemon and sends email to the administrator  Checks the time stamps on the binaries it is configured to spawn, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version

9 9 condor_master (cont’d)  Provides access to many remote administration commands: condor_reconfig condor_restart, condor_off, condor_on  Default server for many other commands: condor_config_val, etc.  Periodically runs condor_preen to clean up any files Condor might have left on the machine (the rest of the daemons clean up after themselves, as well)

10 10 condor_startd  Represents a machine to the Condor pool  Enforces the wishes of the machine owner (the owner’s “policy”)  Responsible for starting, suspending, and stopping jobs  Spawns the appropriate condor_starter, depending on the type of job  Provides other administrative commands: (for example, condor_vacate)

11 11 condor_starter  Spawned by the condor_startd to handle all the details of starting and managing the job (for example, transferring the job’s binary to the executing machine or sending back exit status)  On SMP machines, you get one condor_starter per CPU  For PVM jobs, the starter also spawns a PVM daemon (condor_pvmd)

12 12 condor_schedd  Represents users to the Condor pool  Maintains persistent queue of jobs  Responsible for contacting available machines and spawning waiting jobs  Services most user commands: condor_submit condor_rm condor_q

13 13 condor_shadow  Represents the job on the submit machine  Services requests from “standard” jobs for “remote system calls”, including all file I/O  Is responsible for making decisions on behalf of the job (for example, where to store the checkpoint file)  There will be one condor_shadow process running on your submit machine for each currently running Condor job

14 14 condor_shadow (cont’d)  The shadow doesn’t put much load on your submit machine: Almost always blocked waiting for requests from the job or doing I/O Relatively small memory footprint  Still, you can limit the impact of the shadows on a given submit machine: They can be started by Condor with a “nice-level” that you configure ( renice ) Can put a limit on the total number of shadows running on a machine

15 15 condor_collector  Collects information from all other Condor daemons in the pool  Each daemon sends a periodic update called a “ClassAd” to the collector  Services queries for information: Queries from other Condor daemons Queries from users (condor_status)

16 16 condor_negotiator  Performs “matchmaking” in Condor  Gets information from the collector about all available machines and all idle jobs  Tries to match jobs with machines that will serve them  Both the job and the machine must satisfy each other’s requirements (this is called “2-way matching”)

17 17 Layout of a Personal Condor Pool Central Manager master collector negotiator schedd startd = ClassAd Communication Pathway = Process Spawned

18 18 Layout of a General Condor Pool Central Manager master collector negotiator schedd startd = ClassAd Communication Pathway = Process Spawned Submit-Only master schedd Execute-Only master startd Regular Node schedd startd master Regular Node schedd startd master Execute-Only master startd

19 19 What happens when you submit a job to Condor?  condor_submit contacts condor_schedd and adds job to the job queue  condor_schedd sends ClassAd to the condor_collector requesting a machine  condor_negotiator matches the request with an available machine  condor_schedd “claims” the machine and spawns a condor_shadow

20 20 What happens when you submit a job to Condor? (part 2)  condor_shadow contacts condor_startd and requests the appropriate condor_starter  condor_starter actually spawns application, and connects it to the condor_shadow  condor_startd monitors the machine and waits for commands  Either the application completes, or the condor_startd forces it to either suspend or vacate

21 21 Condor System Structure

22 22 Considerations for Installing a Condor Pool  What machine should be your central manager?  Does your pool have a shared file system?  Where should you install your Condor binaries and configuration files?  Where should you put the local directories for each machine?  Will you start the daemons as root or as some other user?

23 23 What machine should be your central manager?  The central manager (CM) is very important for the proper functioning of your pool  You want a machine that will be online all the time, or will be rebooted quickly if there is a problem  If the CM crashes, jobs that are currently matched will continue to run, but new jobs will not be matched  A good network connection helps

24 24 Does your pool have a shared file system?  A shared file system is essential if you wish to run “vanilla” jobs  It can also make administration of a large pool easier  NFS works better with Condor than AFS, since Condor does not manage AFS tokens (yet), though either one will work

25 25 Where should you install your binaries and configuration files?  Putting the config files on a shared file system makes administration much easier  Putting the binaries on a shared file system makes installing a new version easier, but it can be less stable (since problems with the network can cause daemons to crash)  condor_master on the local disk is a good compromise

26 26 Where should you put the local directories for each machine?  You need a fair amount of disk space in the spool directory for each condor_schedd (to hold the job queue and the binaries for each job submitted).  The execute directory is used by the condor_starter to hold the binary for any Condor job running on a machine  The log directory is used by all daemons… more space = more saved info

27 27 Will you start the daemons as root or some other user?  If you have root access, we recommend you start the daemons as root More secure Less confusion for users  If you don’t have root access, Condor will still work, users just have to take some extra steps to submit jobs  Can have “personal Condor” installed - only you can submit jobs

28 28 Basic Installation Procedure  1) Decide what version and parts of Condor to install and download them  2) Install the “release directory” - all the Condor binaries and libraries  3) Setup the Central Manager  4) (optional) Setup Condor on any other machines you wish to add to the pool  5) Spawn the Condor daemons

29 29 The Different Versions of Condor  We distribute two versions of Condor: Stable Series –Heavily tested, recommended for use –2nd number of version string is even (6.0.3) Development Series –Latest features, not necessarily well-tested –2nd number of version string is odd (6.1.8) –Not recommended unless you know what you are doing and/or need a new feature

30 30 Condor Versions (cont’d)  All daemons advertise a CondorVersion attribute in the ClassAd they publish  You can also view the version string by running ident on any Condor binary  All parts of Condor on a single machine should run the same version!  Machines in a pool can usually run different versions and communicate with each other  It will be made very clear when a version is incompatible with older versions

31 31 Downloading Condor  Go to http://www.cs.wisc.edu/condor/  Fill out the form and download the different pieces you need  Normally, you want the full stable release  There are also “contrib” modules for non- standard parts of Condor, or individual pieces of the development release that you might need (e.g. SMP support)  Distributed as compressed “tar” files  Once you download, unpack them

32 32 Install the Release Directory  In the directory where you unpacked the tar file, you’ll find a release.tar file with all the binaries and libraries  condor_install will install this as the release directory for you  In a pool with a shared release directory, you should run condor_install somewhere with write access to the shared directory  You need a separate release directory for each platform!

33 33 Setup the Central Manager  You must configure Condor specially on your central manager, so that it knows it needs to spawn the additional daemons  Easiest way to do this is by using condor_install  There’s a special option for setting up a central manager

34 34 Setup any other machines you wish to add to the pool  If you have a shared file system, once you run condor_install on your file server (and again on your central manager if it’s a separate machine) you can just run condor_init on any other machine you wish to add to your pool  Without a shared file system, you must run condor_install on each host

35 35 Spawn the Condor daemons  Once Condor is configured and setup, you just have to spawn the condor_master on each host to “start” Condor  You should startup Condor on the Central Manager first  The user you spawn the condor_master as makes a big difference: root vs. “condor” vs. another user

36 36 Introduction to Condor’s Configuration Files  Condor’s configuration is a concatenation of multiple files, in order - definitions in later files overwrites previous definitions  Layout and purpose of the different files: Global config file Other shared files Local config file Root config file (optional)

37 37 Global Config File  All shared settings across your entire pool  Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or the home directory of the “condor” user  Most settings can be in this file  Only works as a “global” file if it is on a shared file system

38 38 Other shared files  You can configure a number of other shared config files: files to hold common settings to make it easier to maintain (for example, all policy expressions, which we’ll see later) platform-specific config files

39 39 Local config file  Any machine-specific settings local policy settings for a given owner different daemons to run (for example, on the Central Manager!)  Can either be on the local disk of each machine, or have separate files in a shared directory, each named by hostname

40 40 Root config file (optional)  You can specify a “root” config file, which is always processed after all other files  This allows root to specify certain settings which cannot be changed by another user (like the path to the Condor daemons)  Only useful if daemons are started as root but someone else has access to edit Condor’s config files

41 41 Basic syntax  # is a comment  A “\” at the end of a line is a line- continuation, so both lines are treated as one big entry  All names are case insensitive  “Macros” have the form: Attribute_Name = value  You reference other macros with: A = $(B)

42 42 Macros vs. Expressions  “Expressions” have the form: Expression_Name : value  Expressions have special meaning to Condor (in particular, the policy expressions)  You will never need to add new expressions  You might want to add macros to advertise new properties about your machines to the Condor pool (for example, NetworkBandwidth)

43 43 Any questions?  Nothing is too basic  If I was unclear, you probably are not the only person who doesn’t understand, and the rest of the day will be even more confusing

44 Hands-On Exercise #1 Installing a “Personal” Condor Pool

45 45 Hands-On Exercise #1  Login to your machine as user “condor”  You will see two windows: Netscape, with instructions An xterm, where you execute commands  To begin, click on Personal Condor  Please follow the directions carefully  Any lines beginning with % are commands that you should execute in your xterm  If you accidentally exit Netscape, click on “Tutorial” in the Start menu

46 Lunch break Please be back by 13:30

47 Welcome Back

48 48 Host/IP Security in Condor  You can configure each machine in your pool to allow or deny certain actions from different groups of machines: “read” access - querying information –condor_status, condor_q, etc “write” access - updating information –condor_submit, adding a node to the pool, etc “administrator” access –condor_on, off, reconfig, restart... “owner” access –Things a machine owner can do (vacate)

49 49 Setting up Host/IP-address Security in Condor (part 1)  To configure, you list what hosts are allowed or denied to perform each action If you list hosts that are allowed, everything else is denied If you list hosts that are denied, everything else is allowed If you list both, only hosts that are listed in “allow” but not in “deny” are allowed

50 50 Setting up Host/IP-address Security in Condor (part 2)  There are many possibilities for specifying which hosts are allowed or denied: Host names, domain names IP addresses, subnets Wildcards –‘*’ can be used anywhere (once) in a host name (for example, “infn-corsi*.corsi.infn.it) –‘*’ can be used at the end of any IP address (e.g. “128.105.101.*” or “128.105.*”)

51 51 Setting up Host/IP-address Security in Condor (part 3)  Can define values that effect all daemons: HOSTALLOW_WRITE, HOSTDENY_READ, HOSTALLOW_ADMINISTRATOR, etc.  Can define daemon-specific settings: HOSTALLOW_READ_SCHEDD, HOSTDENY_WRITE_COLLECTOR, etc.  Write access doesn’t automatically provide read access: you must grant both!

52 52 Example Host/IP Security Settings HOSTALLOW_WRITE = *.infn.it HOSTALLOW_ADMINISTRATOR = infn-corsi1*, \ $(CONDOR_HOST), axpb07.bo.infn.it, \ $(FULL_HOSTNAME) HOSTDENY_ADMINISTRATOR = infn-corsi15 HOSTDENY_READ = *.gov, *.mil HOSTDENY_ADMINISTRATOR_NEGOTIATOR = *

53 Hands-On Exercise #2 Configuring Host/IP Security

54 54 Hands-On Exercise #2  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Host Security  Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

55 55 Administering a Real Pool  Having a shared release directory is key  Viewing things with condor_status  Viewing things with condor_q

56 56 Having a shared release directory is key  Keep all of your config files in one place Allows you to have a real global config file, with common values across the whole pool Much easier to make changes (even for “local” config files in one shared directory)  Keep all of your binaries in one place Prevents having different versions accidentally left on different machines Easier to upgrade

57 57 Viewing things with condor_status  condor_status has lots of different options to display various kinds of info  Supports “-constraint” so you can only view ClassAds that match an expression you specify  Supports “-format” so you can get the data in whatever form you want (very useful for writing scripts)  View any kind of daemon ClassAd

58 58 Viewing things with condor_q  View the job queue  The “-long” option is useful to see the entire ClassAd for a given job  Also supports the “-constraint” option  Can view job queues on remote machines with the “-name” option

59 Hands-On Exercise #3 Setting up a Pool with a Shared Release Directory

60 60 Hands-On Exercise #3  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Shared Release Directory  Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

61 61 Advanced Installation Options  Spawning the Condor daemons automatically at reboot  “Full installation” of condor_compile  Advertising your own attributes in the machine ClassAd  Setting up Host/IP security in Condor (which we already talked about)  Customizing the startd policy

62 62 Spawning the Condor Daemons automatically at reboot  If you are running Condor as root, you probably want to have your boot scripts start the condor_master automatically  Provides more robust service, less manual work for the administrators  We provide a “SysV-style” init script: /etc/examples/condor.boot  Exact details depends on your operating system platform

63 63 Why Perform a “Full Installation” of condor_compile?  condor_compile used to re-link user jobs with the Condor libraries so they become “standard” jobs  By default, condor_compile only works with certain commands ( gcc, g++, g77, cc, CC, f77, f90, ld )  With a “full-installation”, condor_compile will work with any command (in particular, “make”)

64 64 How to Perform a Full Installation of condor_compile:  Move your real ld binary, the “linker”, to “ld.real” The path to “ld” varies from platform to platform… though it’s usually “/bin/ld”  Install Condor’s “ld” script in its place  If condor_compile is used, our ld will do the Condor-specific magic  If not, our ld will just call the real ld and everything will work like normal

65 65 Advertising Your Own Attributes in the Machine ClassAd  Add new macro(s) to the config file This is usually done in the local config file Can name the macros anything, so long as the names don’t conflict with existing ones  Tell the condor_startd to include these other macros in the ClassAd it sends out Edit the STARTD_EXPRS macro to include the names of the macros you want to advertise (comma separated)

66 Hands-On Exercise #4 Defining Your Own Attributes in the Startd Classad

67 67 Hands-On Exercise #4  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Local Startd Attributes  Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

68 10 Minute Break Questions are welcome….

69 69 Configuring the Startd Policy  Allows administrators or machine owners the power to control when and if Condor starts and stops jobs on a machine  Lots of flexibility: can base any policy expression on any attributes in the startd’s ClassAd, or the ClassAd of the currently running job  Many mechanisms available: suspending, checkpointing, hard kill, etc.

70 70 Basic Progression a Job Can Pass Through When an Owner Returns  No owner: job is running  The owner returns: job is suspended If the owner leaves again shortly, the job is resumed  The owner is still there: job is vacated soft-kill... do a checkpoint if possible  The vacate is taking too long: job is hard-killed ( kill -9 )

71 71 Introduction to the Policy Expressions  The policy expressions control the transitions between various “states” and “activities” a machine can be in  All expressions use boolean logic  It is common to define macros for complicated terms in your expressions to make them easier to read  Often, you only need to edit these macros to customize your policy

72 72 Machine States PREEMPTING CLAIMED UNCLAIMED OWNER MATCHED begin

73 73 Machine Activities PREEMPTING CLAIMED UNCLAIMED OWNER MATCHED Benchmarking begin Idle Suspended Busy Idle Killing Vacating Idle

74 74 The Policy Expressions START RANK WANT_SUSPEND SUSPEND CONTINUE PREEMPT WANT_VACATE KILL

75 75 True Road Map of the Policy Expressions STARTSTART WANT SUSPEND SUSPENDSUSPEND VacatingVacating PREEMPTPREEMPT KILLKILL True False WANT VACATE KillingKilling False = Expression = Activity

76 76 The START expression  The most important policy expression  This is the “requirements” expression for machines  Controls when Condor will start jobs  Can reference attributes of the job (such as its size or the user who submitted it)  A machine will only leave the Owner state if START evaluates to True (or Undefined)

77 77 Example Start Expressions KeyboardIsIdle = (KeyboardIdle > (15 * $(MINUTE))) CPUIsIdle = (LoadAvg - CondorLoadAvg < 0.3) START : $(KeyboardIsIdle) && $(CPUIsIdle) or START : Owner == “wright” || Owner == “condor” || \ ($(KeyboardIsIdle) && $(CPUIsIdle)) or START : True

78 78 The RANK Expressions  Both machines and jobs can “rank” what they’re looking for  If a machine is claimed, it still advertises that it’s available Always looking for a higher-ranked job Will preempt the current job if a better one is available.  Jobs can rank machines - in a large pool, users can prefer certain hosts

79 79 Using Machine RANK Expressions  The expression is a floating point number Use “+” instead of “&&” (X == Y) evaluates to 0 or 1 Allows unlimited flexibility  Often used in large pools made up of individual groups that machines owned by one group will always run jobs submitted by the users in that group

80 80 Example Rank Expression MachineOwner = (Owner == “wright”) Friend = (Owner == "tannenba" || \ Owner == ”ballard”) ResearchGroup = (Owner == "jbasney" || \ Owner == "raman”) Rank : Friend + ResearchGroup*10 + \ MachineOwner*20 Startd_Exprs = $(Startd_Exprs), Friend, \ MachineOwner, ResearchGroup

81 81 Example Rank Expression Explained  First, we define different groups of people that we’re interested in (Friend, ResearchGroup and MachineOwner)  Then, we define the Rank (it’s an expression, so we need to use “:”) to give different weights each group  Finally, we add these new attributes to the list of attributes we publish so that Rank can be evaluated remotely

82 82 WANT_SUSPEND vs. SUSPEND  WANT_SUSPEND determines if the startd should even consider entering the Suspended activity: If WANT_SUSPEND is True, while a job is running, SUSPEND is checked, and if it evaluates to True, the job is suspended If WANT_SUSPEND if False, SUSPEND is never evaluated, and while the job is running, PREEMPT is checked

83 83 CONTINUE  Only evaluated while in the Suspended activity (WANT_SUSPEND must therefore be True)  If CONTINUE evaluates to True, the job is resumed and the machine goes back to the Busy activity

84 84 PREEMPT  Specifies when a machine enters the Preempting state  Must handle two cases (and usually has two separate terms in the expression): WANT_SUSPEND is True, and the job has been suspended longer than the owner wants WANT_SUSPEND is False, and the owner is using the machine again

85 85 WANT_VACATE vs. KILL  WANT_VACATE is only evaluated when PREEMPT is True and the machine is entering the Preempting state  Determines if a vacate (checkpoint) is wanted, or if the job should be immediately hard killed  KILL is only evaluated if the job is checkpointing (WANT_VACATE was True)  If True, the job is hard-killed

86 86 Every Possible Transition  START  WANT_SUSPEND  SUSPEND  CONTINUE  PREEMPT  WANT_VACATE  KILL Policy Expressions

87 87 Final Notes on Startd Policy  Please read the Administrator’s Manual to Condor for a complete explanation of the previous diagram See the chapter on “Configuring the Startd Policy”  This is all pretty confusing and complex: If you have questions, please send them to condor-admin@cs.wisc.edu We can try to translate an English explanation of the policy you want into expressions for Condor

88 Hands-On Exercise #5 Customizing the Startd Policy Expressions

89 89 Hands-On Exercise #5  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Startd Policy  Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

90 90 When something goes wrong...  Looking at “condor_q -analyze”  Looking at the “UserLog”  Looking at the “ShadowLog”  Looking at the other daemon’s log files  Condor is a large, distributed system, so analyzing problems can be very difficult: We’ll give you the basics of where to begin If you can’t figure it out, send us email and we’ll be able to help you

91 91 Looking at condor_q -analyze  You specify a job or set of jobs you want to analyze  condor_q will try to figure out why the job isn’t running  The output is not as user-friendly as we’d like (though we’re working on it)  Good at finding errors in Requirements expressions set by users

92 92 Looking at the UserLog  When the user submits a job, she/he can specify a “UserLog” in their submit file  This will contain a record of if and where the job ran, if it checkpointed, if it was kicked off without a checkpoint, etc.  Very useful in figuring out where a job was running when it was having problems, and to monitor the progress of the job  Required by DAGMan and others

93 93 Looking at the ShadowLog  Of the log files generated by the Condor daemons, the ShadowLog usually has the most useful information when debugging a problem with a job  You often want to increase the “Debug Level” of the Shadow and increase the maximum size of the file to get more info: SHADOW_DEBUG = D_SYSCALLS D_FULLDEBUG MAX_SHADOW_LOG = 1000000

94 94 Analyzing the ShadowLog  Incorrect file permissions or files that were removed are the most common errors  Often useful to grep for a certain job ID grep “25\.3” ShadowLog | less  At the end of the log, you might find an entry that looks something like “ERROR:” While not always the most clear, these entries usually give a very good indication of the problem

95 95 Looking at the Other Daemon’s Log Files  If there is no ShadowLog, or no ShadowLog entries for the job with problems, you might have a problem even finding a match for the job  Look in the SchedLog to see if there are errors communicating with the Negotiator  Check for host permission problems  Look at the NegotiatorLog on the CM: is it even negotiating jobs at all

96 96 Analyzing the Logs  All daemons can display more debugging: D_FULLDEBUG and D_COMMAND  Can also get timestamps in the logs that include seconds, which can help pinpoint a problem w/ D_SECONDS  Logs will often rotate quickly with heavy debugging output, so increase MAX_*_LOG as much as your disk space allows  Unfortunately, Condor’s logs are still primarily useful only to the developers  We’re working on changing that

97 Hands-On Exercise #6 Examining the Logs

98 98 Hands-On Exercise #6  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Examining the Logs  Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

99 99 Obtaining Condor  Condor can be downloaded from the Condor web site at: http://www.cs.wisc.edu/condor  Complete Users and Administrators manual available http://www.cs.wisc.edu/condor/manual  Contracted Support is available  Questions? Email: condor-admin@cs.wisc.edu


Download ppt "Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison"

Similar presentations


Ads by Google