Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor Administration European Condor Week 2006 INFN Milan Milan, Italy - June 27,2006
2 Outline › Condor Daemons Job Startup › Configuration Files › Policy Expressions Startd (Machine) Negotiator › Priorities › Useful Tools › Log Files › Debugging Jobs › Security
3 Installation
4 Considerations for Installing a Condor Pool › What machine should be your central manager? › Does your pool have a shared file system? › Where to install Condor binaries and configuration files? › Where should you put each machine’s local directories? › Start the daemons as root or as some other user?
5 What machine should be your central manager? › The central manager is very important for the proper functioning of your pool › If the central manager crashes, jobs that are currently matched will continue to run, but new jobs will not be matched
6 Central Manager › Want assurances of high uptime or prompt reboots Or designate “hot-spares” › A good network connection helps › Speed helps
7 Does your pool have a shared file system? › It is easier to run vanilla universe jobs if so, but one is not required › Shared location for configuration files can ease administration of a pool › AFS can work, but Condor does not yet manage AFS tokens
8 Where to install binaries and configuration files? › Shared location for configuration files can ease administration of a pool › Binaries on a shared file system makes upgrading easier, but can be less stable if there are network problems › condor_master on the local disk is a good compromise
9 Where should you put each machine’s local directories? › You need a fair amount of disk space in the spool directory for Log files Submit machine: the job queue and any spooled data files for each job submitted Execute machine: the binary for any Condor job running on a machine, plus any staged data files
10 Hostnames › Any two machines that will be communicating must know each others names Proper /etc/hosts or DNS including inverse domain
11 Start the daemons as root or some other user? › If possible, we recommend starting the daemons as root Condor better able to enforce policy More secure (!?!?!) Less confusion for users Condor daemons will run as the user “condor” whenever possible, and only switch to root when needed
12 Running Daemons as Non-Root › More secure › Condor will still work, users just have to take some extra steps to submit jobs › Can have “personal Condor” installed - only you can submit jobs
13 Basic Installation Procedure › 1. Decide what version and parts of Condor to install and download them › 2. Install the “release directory” - all the Condor binaries and libraries › 3. Setup the Central Manager › 4. (optional) Setup Condor on any other machines you wish to add to the pool › 5. Spawn the Condor daemons
14 Condor Version Series › We distribute two versions of Condor Stable Series Development Series
15 Stable Series › Heavily tested › Recommended for general use › 2nd number of version string is even (6.6.3)
16 Development Series › Latest features, not necessarily well- tested › Not recommended unless you’re willing to work with beta code or need new features › 2nd number of version string is odd (6.7.1)
17 Condor Versions › What am I running? › All daemons advertise a CondorVersion attribute in the ClassAd they publish › You can also view the version string by Run “condor_version” “-v” option running ident on any Condor binary
18 Condor Versions › All parts of Condor on a single machine should run the same version! › Machines in a pool can usually run different versions and communicate with each other › Documentation will specify when a version is incompatible with older versions
19 Downloading Condor › Go to › Fill out the form and download the different pieces you need Normally, you want the full stable release › There are also “contrib” modules for non-standard parts of Condor For example, the View Server
20 Downloading Condor › Unix Distributed as compressed “tar” files Once you download, unpack them What about distribution XXX of Linux? See › Windows MSI installer or ZIP file
21 Install the Release Directory › In the directory where you unpacked the tar file, you’ll find a release.tar file with all the binaries and libraries › Use condor_install or condor_configure › condor_install will install this as the release directory for you
22 condor_install › Our old installation script › Interactive › Overly complex
23 condor_configure › New script › Handles installation and reconfiguration condor_configure --install --install-dir=/nfs/opt/condor --local-dir=/var/condor --owner=condor
24 Setup the Central Manager › Central manager needs specific configuration to start the condor_collector and condor_negotiator condor_configure --type=manager
25 Setup Additional Machines › If you have a shared file system, just run condor_init on any other machine you wish to add to your pool › OR, run condor_configure --central-manager=
26 Spawn the Condor daemons › Run condor_master to start Condor Remember to start as root if desired › Start Condor on the central manager first › Add Condor to your boot scripts? We provide a “SysV-style” init script ( /etc/examples/condor.boot )
27 Condor-G Special Notes › Condor-G should work out of the box › Globus can push several limits, consider increasing: /proc/sys/fs/file-max /proc/sys/net/ipv4/ip_local_port_range Per process file descriptor limits
28 Other Unix Installation Options › Use RPM package (from Condor web site) › VDT – Virtual Data Toolkit PacMan installer Includes other Grid software
29 Windows Installation › v6.6.x : InstallShield installation program Interactive GUI or via command-line with “saved answers file” › v6.7.x : MSI installation program Interactive GUI or MSI unattended › ZIP file
30 Other Sources › Condor Manual › Condor Web Site › condor-users mailing list ›
31 Condor Daemons Title unknown, by Hans Holbein the Younger, from Historiarum Veteris Testamenti icones, 1543
32 What Condor Daemons are running on my machine, and what do they do?
33 Condor Daemon Layout Personal Condor / Central Manager master collector negotiator scheddstartd = Process Spawned
34 condor_master › Starts up all other Condor daemons › If there are any problems and a daemon exits, it restarts the daemon and sends to the administrator › Checks the time stamps on the binaries of the other Condor daemons, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version master
35 condor_master (cont’d) › Acts as the server for many Condor remote administration commands: condor_reconfig, condor_restart, condor_off, condor_on, condor_config_val, etc.
36 condor_startd › Represents a machine to the Condor system › Responsible for starting, suspending, and stopping jobs › Enforces the wishes of the machine owner (the owner’s “policy”… more on this soon) master startd
37 condor_schedd › Represents users to the Condor system › Maintains the persistent queue of jobs › Responsible for contacting available machines and sending them jobs › Services user commands which manipulate the job queue: condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio, … master scheddstartd
38 condor_collector › Collects information from all other Condor daemons in the pool “Directory Service” / Database for a Condor pool › Each daemon sends a periodic update called a “ClassAd” to the collector › Services queries for information: Queries from other Condor daemons Queries from users (condor_status) schedd collector master startd
39 condor_negotiator › Performs “matchmaking” in Condor › Gets information from the collector about all available machines and all idle jobs › Tries to match jobs with machines that will serve them › Both the job and the machine must satisfy each other’s requirements master collector negotiator scheddstartd
40 Central Manager › The Central Manager is the machine running the collector and negotiator DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR › Defines a Condor pool. CONDOR_HOST = centralmanager.example.com
41 Typical Condor Pool Central Manager master collector negotiator schedd startd = ClassAd Communication Pathway = Process Spawned Submit-Only master schedd Execute-Only master startd Regular Node schedd startd master Regular Node schedd startd master Execute-Only master startd
42 Job Startup “LUNAR Launch” by “jurvetson” ©
43 Execute MachineSubmit Machine Job Startup Submit Schedd Starter Job Shadow Condor Syscall Lib Startd Central Manager CollectorNegotiator
44 condor_starter › Spawned by the condor_startd to handle all the details of starting and managing the job Transfer job’s binary to execute machine Send back exit status Etc.
45 condor_starter › On multi-processor machines, you get one condor_starter per CPU Actually one per running job Can configure to run more (or less) jobs than CPUs › For PVM jobs, the starter also spawns a PVM daemon ( condor_pvmd )
46 condor_shadow › Represents job on the submit machine › Services requests from standard universe jobs for remote system calls including all file I/O › Makes decisions on behalf of the job for example: where to store the checkpoint file
47 condor_shadow Impact › One condor_shadow running on submit machine for each actively running Condor job › Minimal load on submit machine Usually blocked waiting for requests from the job or doing I/O Relatively small memory footprint
48 Limiting condor_shadow › Still, you can limit the impact of the shadows on a given submit machine: They can be started by Condor with a “nice-level” that you configure ( SHADOW_RENICE_INCREMENT ) Can limit total number of shadows running on a machine ( MAX_JOBS_RUNNING )
49 Configuration Files “amp wiring” by “fabz” ©
50 Configuration Files › Multiple files concatenated Definitions in later files overwrite previous definitions › Order of files: Global config file Local config files Root config files
51 Global Config File › Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or ~condor/condor_config › Most settings can be in this file › Typically kept on a shared file system
52 Other Shared Files › LOCAL_CONFIG_FILE macro Comma separated, processed in order › You can configure a number of other shared config files: Organize common settings (for example, all policy expressions) platform-specific config files
53 Local Config File › Specify in LOCAL_CONFIG_FILE macro Tip: can use $(HOSTNAME) › Machine-specific settings local policy settings for a given owner different daemons to run (for example, on the Central Manager!)
54 Local Config File, cont. › Can be on local disk of each machine: /var/adm/condor/condor_config.local › Can be in a shared directory: /shared/condor/condor_config.$(HOSTNAME) /shared/condor/hosts/$(HOSTNAME)/condor_config.local
55 Root Config File › Always processed last › Allows root to specify settings which cannot be changed by other users For example, the path to Condor daemons › Useful if daemons are started as root but someone else has write access to config files
56 Root Config File › /etc/condor/condor_config.root or ~condor/condor_config.root › Then loads any files specified in ROOT_CONFIG_FILE_LOCAL
57 Configuration File Syntax › # at start of line is a comment not allowed in names, confuses Condor. › \ at the end of line is a line- continuation Both lines are treated as one big entry Works in comments! # This comment eats the next line \ EXAMPLE_SETTING=TRUE
58 Configuration File Macros › Macros have the form: Attribute_Name = value Names are case insensitive Values are case sensitive › You reference other macros with: A = $(B) › Can create additional macros for organizational purposes
59 Configuration File Macros › Can append to macros: A=abc A=$(A),def › Don’t let macros recursively define each other! A=$(B) B=$(A)
60 Configuration File Macros › Later macros in a file overwrite earlier ones B will evaluate to 2: A=1 B=$(A) A=2
61 ClassAds › Set of key-value pairs › Can be matched against each other Requirements and Rank MY.name – Looks for “name” in local ClassAd TARGET.name – Looks for “name” in the other ClassAd Name – Looks for “name” in the local ClassAd, then the other ClassAd
62 ClassAd Expressions › Some configuration file macros specify expressions for the Machine’s ClassAd Notably START, RANK, SUSPEND, CONTINUE, PREEMPT, KILL › Can contain a mixture of macros and ClassAd references › Notable: UNDEFINED, ERROR
63 ClassAd Expressions › +, -, *, /,, >=, ==, !=, &&, and || all work as expected › TRUE==1 and FALSE==0 (guaranteed)
64 ClassAd Expressions: UNDEFINED and ERROR › Special values › Passed through most operators Anything == UNDEFINED is UNDEFINED › && and || eliminate if possible. UNDEFINED && FALSE is FALSE UNDEFINED && TRUE is UNDEFINED
65 ClassAd Expressions: =?= and =!= =?= and =!= are similar to == and != =?= tests if operands have the same type and the same value. 10 == UNDEFINED -> UNDEFINED UNDEFINED == UNDEFINED -> UNDEFINED 10 =?= UNDEFINED -> FALSE UNDEFINED =?= UNDEFINED -> TRUE =!= inverts =?=
66 ClassAd Functions in Condor! › Conditionals IfThenElse(condition,then,else) › String functions Strcat(), strcmp(), toUpper(), etc. › StringList functions Example of a “string list” (CSV style) Mylist = “Joe, Jon, Jeff, Jim, Jake” StrListContains(), StrListAppend(), StrListRemove(), etc. › Others Regular expressions, arithmetic, etc…
67 ClassAd Expressions › Further information: Section 4.1, “Condor's ClassAd Mechanism,” in the Condor Manual.
68 Policy › Allow machine owners to specify job priorities, restrict access, and implement other local policies “Don't even think about it” by “tyger_lyllie” ©
69 Policy Expressions › Specified in condor_config › Policy evaluates both a machine ClassAd and a job ClassAd together Policy can reference items in either ClassAd (See manual for list), ~140 attributes. › Can reference condor_config macros: $(MACRONAME)
70 Machine (Startd) Policy Expression Summary › START – When is this machine willing to start a job Typically used to restrict access when the machine is being used directly › RANK - Job preferences
71 Machine (Startd) Policy Expression Summary › SUSPEND - When to suspend a job › CONTINUE - When to continue a suspended job › PREEMPT – When to nicely stop running a job › KILL - When to immediately kill a preempting job
72 START › START is the primary policy › When FALSE the machine enters the Owner state and will not run jobs › Acts as the Requirements expression for the machine, the job must satisfy START Can reference job ClassAd values including Owner and ImageSize
73 RANK › Indicates which jobs a machine prefers Jobs can also specify a rank › Floating point number Larger numbers are higher ranked Typically evaluate attributes in the Job ClassAd Typically use + instead of &&
74 RANK › Often used to give priority to owner of a particular group of machines › Claimed machines still advertise looking for higher ranked job to preempt the current job
75 SUSPEND and CONTINUE › When SUSPEND becomes true, the job is suspended › When CONTINUE becomes true a suspended job is released
76 PREEMPT and KILL › When PREEMPT becomes true, the job will be politely shut down Vanilla universe jobs get SIGTERM Or user requested signal Standard universe jobs checkpoint › When KILL becomes true, the job is SIGKILL Checkpointing is aborted if started
77 WANT_SUSPEND and WANT_VACATE › Typically leave both to TRUE › WANT_SUSPEND - If false, skip SUSPEND test, jump to PREEMPT › WANT_VACATE If true, gives job time to vacate cleanly (until KILL becomes true) If false, job is immediately killed ( KILL is ignored)
78 True Road Map of the Policy Expressions START WANT SUSPEND SUSPEND Vacating PREEMPT KILL True False WANT VACATE Killing False Expression Activity
79 Minimal Settings › Always runs jobs START = True RANK = SUSPEND = False CONTINUE = True PREEMPT = False KILL = False “Lonely at the top” by “gumuz” ©
80 (Boss Fat Cat) › I am adding nodes to the Cluster… but the Chemistry Department has priority on these nodes Policy Configuration
81 New Settings for the Chemistry nodes › Prefer Chemistry jobs START = True RANK = Department == "Chemistry" SUSPEND = False CONTINUE = True PREEMPT = False KILL = False
82 Submit file with Custom Attribute › Prefix an entry with “+” to add to job ClassAd Executable = charm-run Universe = standard +Department = Chemistry queue
83 What if “Department” not specified? START = True RANK = Department =!= UNDEFINED && Department == "Chemistry" SUSPEND = False CONTINUE = True PREEMPT = False KILL = False
84 More Complex RANK › Give the machine’s owners (adesmet and roy) highest priority, followed by the Chemistry department, followed by the Physics department, followed by everyone else.
85 More Complex RANK IsOwner = (Owner == "adesmet" || Owner == "roy") IsChem =(Department =!= UNDEFINED && Department == "Chemistry") IsPhys =(Department =!= UNDEFINED && Department == "Physics") RANK = $(IsOwner)*20 + $(IsChem)*10 + $(IsPhys)
86 › Cluster is okay, but... Condor can only use the desktops when they would otherwise be idle Policy Configuration (Boss Fat Cat)
87 Defining Idle › One possible definition: No keyboard or mouse activity for 5 minutes Load average below 0.3
88 Desktops should › START jobs when the machine becomes idle › SUSPEND jobs as soon as activity is detected › PREEMPT jobs if the activity continues for 5 minutes or more › KILL jobs if they take more than 5 minutes to preempt
89 Macros in the Config File NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) BgndLoad = 0.3 CPU_Busy = ($(NonCondorLoadAvg) >= $(BgndLoad)) CPU_Idle = ($(NonCondorLoadAvg) < $(BgndLoad)) KeyboardBusy = (KeyboardIdle < 10) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer = \ (CurrentTime - EnteredCurrentActivity)
90 Desktop Machine Policy START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND= $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT= (Activity == "Suspended") && \ $(ActivityTimer) > 300 KILL = $(ActivityTimer) > 300
91 Mission Accomplished › Our policy is in place and everyone is happy “Autumn and Blue Eyes” by “PJLewis” ©
92 Machine States PREEMPTING CLAIMED UNCLAIMED OWNER MATCHED begin
93 Machine Activities PREEMPTING CLAIMED UNCLAIMED OWNER MATCHED Benchmarking begin Idle Suspended Busy Idle Killing Vacating
94 Machine Activities PREEMPTING CLAIMED UNCLAIMED OWNER MATCHED Benchmarking begin Idle Suspended Busy Idle Killing Vacating Idle Ignores new Backfill state! See the manual for the gory details (Section 3.6: Startd Policy Configuration)
95 Custom Machine Attributes › Custom attributes added to job ClassAds via ‘+’ in job submit file › Can add attributes to a machine’s ClassAd, typically done in the local config file INSTRUCTIONAL=TRUE NETWORK_SPEED=100 STARTD_EXPRS=INSTRUCTIONAL,NETWORK_SPEED
96 Custom Machine Attributes › Jobs can now specify Rank and Requirements using new attributes: Requirements = (INSTRUCTIONAL=?=UNDEFINED || INSTRUCTIONAL==FALSE) Rank = NETWORK_SPEED
97 Policy Review › Users submitting jobs can specify Requirements and Rank expressions › Administrators can specify Startd policy expressions individually for each machine › Custom attributes easily added › You can enforce almost any policy!
98 Further Machine Policy Information › For further information, see section 3.6 “Startd Policy Configuration” in the Condor manual › condor-users mailing list ›
99 Priorities “IMG_2476” by “Joanne and Matt” ©
100 Job Priority › Set with condor_prio › Integers, larger numbers are higher priority › Only impacts order between jobs for a single user on a single schedd
101 User Priority › Determines allocation of machines to waiting users › View with condor_userprio › Inversely related to machines allocated A user with priority of 10 will be able to claim twice as many machines as a user with priority 20
102 User Priority › Effective User Priority is determined by multiplying two factors Real Priority Priority Factor
103 Real Priority › Based on actual usage › Defaults to 0.5 › Approaches actual number of machines used over time Configuration setting PRIORITY_HALFLIFE
104 Priority Factor › Assigned by administrator Set with condor_userprio › Defaults to 1 ( DEFAULT_PRIO_FACTOR ) › Nice users default to 1,000,000 ( NICE_USER_PRIO_FACTOR ) Used for true bottom feeding jobs Add “ nice_user=true ” to your submit file
105 Negotiator Policy Expressions › PREEMPTION_REQUIREMENTS and PREEMPTION_RANK › Evaluated when condor_negotiator considers replacing a lower priority job with a higher priority job › Completely unrelated to the PREEMPT expression
106 PREEMPTION_REQUIREMENTS › If false will not preempt machine Typically used to avoid pool thrashing PREEMPTION_REQUIREMENTS = \ $(StateTimer) > (1 * $(HOUR)) \ && RemoteUserPrio > SubmittorPrio * 1.2 Only replace jobs running for at least one hour and 20% lower priority
107 PREEMPTION_RANK › Picks which already claimed machine to reclaim PREEMPTION_RANK = \ (RemoteUserPrio * )\ - ImageSize Strongly prefers preempting jobs with a large (bad) priority and a small image size
108 Tools “Tools” by “book_slut73” ©
109 condor_config_val › Find current configuration values % condor_config_val MASTER_LOG /var/condor/logs/MasterLog
110 condor_config_val -v › Can identify source % condor_config_val –v CONDOR_HOST CONDOR_HOST: condor.cs.wisc.edu Defined in ‘/etc/condor_config.hosts’, line 6
111 condor_fetchlog › Retrieve logs remotely condor_fetchlog beak.cs.wisc.edu Master
112 Querying daemons condor_status › Queries the collector for information about daemons in your pool › Defaults to finding condor_startd s › condor_status –schedd summarizes all job queues › condor_status –master returns list of all condor_master s
113 condor_status › -long displays the full ClassAd › Specify a machine name to limit results to a single host condor_status –l node4.cs.wisc.edu
114 condor_status -constraint › Only return ClassAds that match an expression you specify › Show me idle machines with 1GB or more memory condor_status -constraint 'Memory >= 1024 && Activity == "Idle"‘
115 condor_status -format › Controls format of output › Useful for writing scripts › Uses C printf style formats One field per argument
116 condor_status -format › Census of systems in your pool: % condor_status -format '%s ' Arch -format '%s\n' OpSys | sort | uniq –c 797 INTEL LINUX 118 INTEL WINNT SUN4u SOLARIS28 6 SUN4x SOLARIS28
117 Examining Queues condor_q › View the job queue › The “ -long ” option is useful to see the entire ClassAd for a given job › supports –constraint and -format › Can view job queues on remote machines with the “ -name ” option
118 condor_q -format › Census of jobs per user % condor_q -format '%8s ' Owner -format '%s\n' Cmd | sort | uniq –c 64 adesmet /scratch/submit/a.out 2 adesmet /home/bin/run_events 4 smith /nfs/sim1/em2d3d 4 smith /nfs/sim2/em2d3d
119 condor_q -analyze › condor_q will try to figure out why the job isn’t running › Good at determining that no machine matches the job Requirements expressions
120 condor_q -analyze › Typical results: : Run analysis summary. Of 820 machines, 458 are rejected by your job's requirements 25 reject your job because of their own requirements 0 match, but are serving users with a better priority in the pool 4 match, but reject the job for unknown reasons 6 match, but will not currently preempt their existing job 327 are available to run your job
121 condor_q –better-analyze › In current 6.7 releases › Breaks down the job’s requirements and suggests modifications › Very slow.
122 condor_q –better-analyze › (Heavily truncated output) The Requirements expression for your job is: ( ( target.Arch == "SUN4u" ) && ( target.OpSys == "WINNT50" ) && [snip] Condition Machines Suggestion 1 (target.Disk > ) 0 MODIFY TO (target.Memory > 10000) 0 MODIFY TO (target.Arch == "SUN4u") (target.OpSys == "WINNT50") 110 MOD TO "SOLARIS28" Conflicts: conditions: 3, 4
123 Condor’s Log Files › Condor maintains one log file per daemon “Ready for the Winter” by “bcmom” ©
124 Condor’s Log Files › Can increase verbosity of logs on a per daemon basis SHADOW_DEBUG, SCHEDD_DEBUG, and others Space separated list
125 Useful Debug Levels › D_FULLDEBUG dramatically increases information logged › D_COMMAND adds information about about commands received SHADOW_DEBUG = \ D_FULLDEBUG D_COMMAND
126 Condor’s Log Files › Log files are automatically rolled over when a size limit is reached Defaults to 1,000,000 bytes Old versions were 64,000 bytes, may want to increase Rolls over quickly with D_FULLDEBUG MAX_*_LOG, one setting per daemon MAX_SHADOW_LOG, MAX_SCHEDD_LOG, and others
127 Condor’s Log Files › Many log files entries primarily useful to Condor developers Especially if D_FULLDEBUG is on Minor errors are often logged but corrected Take them with a grain of salt
128 Debugging Jobs “Wanna buy a Beetle?” by “Kevin” ©
129 Debugging Jobs: condor_q › Examine the job with condor_q especially -long and –analyze Compare with condor_status –long for a machine you expected to match
130 Debugging Jobs: User Log › Examine the job’s user log Can find with: condor_q -format '%s\n' UserLog 17.0 Set with “log” in the submit file › Contains the life history of the job › Often contains details on problems
131 Debugging Jobs: ShadowLog › Examine ShadowLog on the submit machine Note any machines the job tried to execute on There is often an “ERROR” entry that can give a good indication of what failed
132 Debugging Jobs: Matching Problems › No ShadowLog entries? Possible problem matching the job. Examine ScheddLog on the submit machine Examine NegotiatorLog on the central manager
133 Debugging Jobs: Local Problems › ShadowLog entries suggest an error but aren’t specific? Examine StartLog and StarterLog on the execute machine
134 Debugging Jobs: Reading Log Files › Condor logs will note the job ID each entry is for Useful if multiple jobs are being processed simultaneously grepping for the job ID will make it easy to find relevant entries
135 Debugging Jobs: What Next? › If necessary add “ D_FULLDEBUG D_COMMAND ” to DEBUG_DAEMONNAME setting for additional log information › Increase MAX_DAEMONNAME_LOG if logs are rolling over too quickly › If all else fails, us
136 Security “Padlock” by “Peter Ford” ©
137 Default Security › Out of the box Relatively open, hostname and IP address based Limit to your local IP addresses or get behind a firewall! “Mining museum door” by “byrdiegyrl” ©
138 Strong Security Capabilities › Strong authentication of users and other daemons › Encryption of transmitted data › Integrity validation of transmitted data Untitled, by Brian De Smet, ©
139 Security Options › Heavily configurable Many authentication systems supported –GSI (Globus), Kerberos, Windows NT authentication, public key, shared password, filesystem, and more High granularity Per access level: read, write, daemon-to-daemon, administrator, machine owner, etc. Per daemon
140 Problems With Default Installation › Host-based granularity is too big Any user who can login to central manager has “Administrator” privileges Any user on a machine can evict the job on that machine via condor_vacate
141 Problems With Default Installation › Most connections are NOT authenticated Queue management commands are. But other daemon-to-daemon commands are not. It is possible to send false information to the collector and other denials of service
142 Problems With Default Installation › Traffic is not private (encrypted) or checked for integrity Possibility of someone eavesdropping on your traffic, including files transferred to or from execute machine Possibility of someone modifying your traffic without detection
143 Security Policy Configuration › Condor provides many mechanisms to address the previous shortcomings: Many authentication methods Strong encryption Signed checksums for integrity
144 Security Policy Configuration › Condor will negotiate security requirements and supported methods client server I want to submit a job You must authenticate w/ kerberos KERBEROS normal submit protocol
145 Security Policy Configuration Default Policy SEC_DEFAULT_ENCRYPTION = OPTIONAL SEC_DEFAULT_INTEGRITY = OPTIONAL SEC_DEFAULT_AUTHENTICATION = OPTIONAL SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI, KERBEROS, SSL, PASSWORD#UNIX SEC_DEFAULT_AUTHENTICATION_METHODS = NTSSPI, KERBEROS, SSL, PASSWORD#WIN32
146 Security Policy Configuration Default Policy Possible Policy Values NEVERdo not allow this to happen OPTIONALdo not request it, but allow it PREFFEREDrequest it, but do not require it REQUIREDthis is mandatory SEC_DEFAULT_ENCRYPTION = OPTIONAL SEC_DEFAULT_INTEGRITY = OPTIONAL SEC_DEFAULT_AUTHENTICATION = OPTIONAL SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI, KERBEROS, SSL, PASSWORD#UNIX SEC_DEFAULT_AUTHENTICATION_METHODS = NTSSPI, KERBEROS, SSL, PASSWORD#WIN32
147 Security Policy Configuration YYYX YYYN YYNN XNNN R P O N Required Preferred Optional Never Client Policy Server Policy Policy Reconciliation
148 Security Policy Configuration Very Secure Policy SEC_DEFAULT_ENCRYPTION = REQUIRED SEC_DEFAULT_INTEGRITY = REQUIRED SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = SSL
149 Security Policy Configuration Policy Reconciliation Example 1 CLIENT POLICY SEC_DEFAULT_ENCRYPTION = OPTIONAL SEC_DEFAULT_INTEGRITY = OPTIONAL SEC_DEFAULT_AUTHENTICATION = OPTIONAL SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI, KERBEROS, SSL, PASSWORD SERVER POLICY SEC_DEFAULT_ENCRYPTION = REQUIRED SEC_DEFAULT_INTEGRITY = REQUIRED SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = SSL RECONCILED POLICY ENCRYPTION = YES INTEGRITY = YES AUTHENTICATION = YES METHODS = SSL
150 Security Policy Configuration Policy Reconciliation Example 2 CLIENT POLICY SEC_DEFAULT_ENCRYPTION = PREFERRED SEC_DEFAULT_INTEGRITY = OPTIONAL SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = SSL, GSI, KERBEROS, PASSWORD SERVER POLICY SEC_DEFAULT_ENCRYPTION = NEVER SEC_DEFAULT_INTEGRITY = PREFERRED SEC_DEFAULT_AUTHENTICATION = OPTIONAL SEC_DEFAULT_AUTHENTICATION_METHODS = FS,GSI,SSL RECONCILED POLICY ENCRYPTION = NO INTEGRITY = YES AUTHENTICATION = YES METHODS = GSI,SSL
151 Security Policy Configuration › As you can see, you may specify a different policy for each authorization level Another Example Policy SEC_DEFAULT_ENCRYPTION = OPTIONAL SEC_DEFAULT_INTEGRITY = PREFERRED SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI, KERBEROS, PASSWORD, SSL SEC_READ_INTEGRITY = OPTIONAL SEC_READ_AUTHENTICATION = OPTIONAL SEC_CLIENT_INTEGRITY = OPTIONAL SEC_CLIENT_AUTHENTICATION = OPTIONAL
152 Security Policy Configuration › Possible values for authorization levels: READ WRITE CONFIG ADMINISTRATOR OWNER DAEMON NEGOTIATOR SOAP › Special values: CLIENT, DEFAULT
153 Security Policy Configuration There is now a hierarchy of authorization: READ WRITE DAEMONADMINISTRATOR CONFIG OWNER SOAP NEGOTIATOR
154 Security Policy Configuration Once you have authenticated users, you may use a more fine-grained authorization list: ALLOW_WRITE = ALLOW_WRITE = ALLOW_WRITE =
155 Security Policy Configuration › Format of canonical username: › One wildcard allowed in the portion, and one allowed in the host portion
156 Security Policy Configuration › Usernames can now be mapped with regular expressions. Old GSI mapfile: “/C=US/ST=Wisconsin/L=Madison/O=University of Wisconsin -- Madison/O=Computer Sciences Department/OU=Condor Project/CN=Zach “/C=US/ST=Wisconsin/L=Madison/O=University of Wisconsin -- Madison/O=Computer Sciences Department/OU=Condor Project/CN=Todd Etc.
157 Security Policy Configuration › New mapfile applies to all types of authentication › In your condor_config: SEC_CANONICAL_MAPFILE = /path/to/mapfile › Each line is a mapping rule
158 Security Policy Configuration › Sample map file: CLAIMTOBE.*nobody FS(.*)\1 GSI =(.*)\1
159 Security Policy Configuration › Review: The security policy determines which types of security mechanism will be used for a given type of command: SEC_ADMINISTRATOR_AUTHENTICATION = REQUIRED SEC_ADMINISTRATOR_AUTHENTICATION_METHODS = GSI
160 Security Policy Configuration › Review: The map file gives you a canonical name from the authenticated user: GSI =(.*)\1 “/C=US/ST=Wisconsin/L=Madison/O=University of Wisconsin – Madison/O=Computer Sciences Department/OU=Condor Project /CN=Zach
161 Security Policy Configuration › Review: The ALLOW_ and DENY_ lists control the authorization of the canonical users: ALLOW_WRITE = DENY_WRITE = *.evildomain.net
162 Configuring Authentication Methods CLAIMTOBE is not configurable, and is meant for testing purposes only. ANONYMOUS is also not configurable, and always sends the username “CONDOR_ANONYMOUS”
163 Configuring Authentication Methods FS works by creating a file in /tmp, and checking to see who the owner of that file is. FS_REMOTE is similar, but allows you to specify where to create the file, allowing you to use a directory on NFS or some other shared storage. This is specified like so: FS_REMOTE_DIR = /shared/temp/space
164 Configuring Authentication Methods NTSSPI uses Windows’ SSPI authentication. This requires that the username and password be the same on two machines that are authenticating to each other. There are no configuration options for this method.
165 Configuring Authentication Methods GSI is the Globus Toolkit’s implementation of X.509. There are quite a number of configuration options for this which essentially specify which credentials to use, which to expect, and who is allowed to sign the credentials.
166 Configuring Authentication Methods When using GSI, a user typically has a public key (certificate) and a private key that is encrypted with a password. A temporary certificate, called a proxy, is created for use with condor: % grid-proxy-init -cert zmiller.crt -key zmiller.key Your identity: /C=US/ST=Wisconsin/L=Madison/O=University of Wisconsin -- Madison/O=Computer Sciences Department/OU=Condor Project/CN=Zach Enter GRID pass phrase for this identity: Creating proxy Done
167 Configuring Authentication Methods For the Condor daemons to use GSI authentication, however, they will need either a proxy or a private key with no password. Proxies would need to be valid for a considerable length of time to be useful in this context, so usually a cert/keypair is used.
168 Configuring Authentication Methods GSI_DAEMON_CERT = condor_cred.crt GSI_DAEMON_KEY = condor_cred.key Also, Condor needs to know the public key of any CA that signs certificates: GSI_DAEMON_TRUSTED_CA_DIR=/path/to/keys
169 Configuring Authentication Methods Typically, there is a “host credential” which is owned by root in: /etc/grid-security/host.crt /etc/grid-security/host.key And also a directory of signing keys in /etc/grid-security/certificates
170 Configuring Authentication Methods Condor will use those defaults without any additional configuration, if it has root privilege to read the host key. Alternatively, you can specify your own directory to hold the key files and the signing certificates by using: GSI_DAEMON_DIRECTORY=~/globus/
171 Configuring Authentication Methods Since GSI is currently available only on UNIX systems, we have also added support for OpenSSL. It needs essentially the same information as GSI, which is specified in this case with a different set of configuration parameters: AUTH_SSL_SERVER_CAFILE AUTH_SSL_SERVER_CADIR AUTH_SSL_SERVER_CERTFILE AUTH_SSL_SERVER_KEYFILE AUTH_SSL_CLIENT_CAFILE AUTH_SSL_CLIENT_CADIR AUTH_SSL_CLIENT_CERTFILE AUTH_SSL_CLIENT_KEYFILE
172 Configuring Authentication Methods For both GSI and OpenSSL, you can use the openssl command line tool to create all the necessary keys and files. For more details: › On Condor web site: › Forthcoming in the Condor Manual… › Google.
173 Configuring Authentication Methods KERBEROS is now available on both UNIX and Windows platforms On UNIX, it authenticates against a KDC (Kerberos Domain Controller). On Windows, the Active Directory server fills this role.
174 Configuring Authentication Methods For Kerberos, a user may need to run kinit if that is not part of the normal login process: % kinit Password for % klist Ticket cache: FILE:/var/adm/krb5/tmp/tkt/tkt_24842_pid14957 Default principal: Valid starting Expires Service principal 04/24/06 12:01:07 05/24/06
175 Configuring Authentication Methods The Condor daemons can use the “host credential” if they are running as root. The default keytab file is typically in /etc/v5srvtab Alternatively, you can specify your own keytab file and server principal to use: KERBEROS_SERVER_KEYTAB = /scratch/zmiller/keytab.zmiller KERBEROS_SERVER_PRINCIPAL =
176 Configuring Authentication Methods PASSWORD authentication also works on both UNIX and Windows, although the interface is somewhat different. On Windows, the password is stored in a secure part of the registry. On UNIX, the password is stored in a file specified by: SEC_PASSWORD_FILE = /path/to/file
177 Configuring Authentication Methods To set the pool password, use the condor_store_cred command: condor_store_cred –c add
178 PASSWORD protect pools › PASSWORD authentication currently only for daemons to authenticate with each other as id SEC_PASSWORD_FILE = /var/condor/.pool_password SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_READ_AUTHENTICATION = OPTIONAL SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD ALLOW_CONFIG = ALLOW_DAEMON = ALLOW_NEGOTIATOR =
179 Security Conclusions Now you have seen a general overview of the different methods that Condor supports. It is very flexible, but the downside is complexity. Our plan: “recipes” in the manual and automation in the installation programs.
Condor on Windows
181 Overview › Latest features › Running jobs as submitting user › Cross-platform authentication methods (Kerberos, SSL, Password) › Running condor in an unprivileged account
182 Running Jobs as the Submitting User
183 condor_store_cred add C:\>condor_store_cred add Account: Enter password: Operation succeeded. › Contacts local schedd and asks it to securely store a user’s password › Password is placed encrypted in a registry location y0urs myp4sswd
184 condor_store_cred query C:\>condor_store_cred query Account: A credential is stored and is valid. › Checks if password is stored for your user name › Also makes sure password is up to date (by making sure it can be used to log in)
185 condor_store_cred delete C:\>condor_store_cred delete Account: Enter password: Operation succeeded. › Removes password from secure password store
186 Job Execution: Submit Side schedd submit shadow y0urs myp4sswd Secure Password Store
187 Job Execution: Execute Side starter condor_exec.exe Jobs run using a Condor-specific account with minimal privileges. condor-reuse-vm1
188 Job Execution: Execute Side starter condor_exec.exe y0urs myp4sswd schedd VM1_USER = CROW\gquinn VM2_USER = CROW\gquinn
189 It’d be nice if… › My jobs could access my files just like the condor_shadow can › I didn’t have to tie my execute machines to a single account › I didn’t have to run condor_store_cred from every machine where my credential is needed
190 The Windows CredD y0urs myp4sswd C:\>condor_store_cred add Account: Enter password: Operation succeeded. credd › A centralized repository for user passwords “store password”
191 The Windows CredD y0urs myp4sswd schedd shadow Submit machines can use the CredD to impersonate the user in the shadow “fetch password”
192 The Windows CredD y0urs myp4sswd starter condor_exec.exe Execute machines can use the CredD to run jobs as the submitting user! “fetch password”
193 Running Jobs as Submitting User universe = vanilla executable = whoami.exe log = whoami.log output = whoami.out run_as_owner = true queue › Example submit file:
194 Running Jobs as Submitting User CREDD_HOST = vault.cs.wisc.edu STARTER_ALLOW_RUNAS_OWNER = True CREDD_CACHE_LOCALLY = True SEC_CLIENT_AUTHENTICATION_METHODS = \ NTSSPI, PASSWORD › In config file on submit and execute nodes:
195 Running Jobs as Submitting User › See example config file included with Condor: condor_config.local.credd # Set security settings so that full security to the credd is required CREDD.SEC_DEFAULT_AUTHENTICATION =REQUIRED CREDD.SEC_DEFAULT_ENCRYPTION = REQUIRED CREDD.SEC_DEFAULT_INTEGRITY = REQUIRED CREDD.SEC_DEFAULT_NEGOTIATION = REQUIRED # Require PASSWORD auth for password fetching CREDD.SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD # Only honor password fetch requests to the trusted "condor_pool" user CREDD.ALLOW_DAEMON =
196 Securing the CredD y0urs myp4sswd C:\>condor_store_cred add Account: Enter password: Operation succeeded. credd › NTSSPI can be used to authenticate to CredD and send the password encrypted over the network “store password”
197 Securing the CredD y0urs myp4sswd starter condor_exec.exe “fetch password” Condor normally runs as SYSTEM, and therefore can’t use NTSSPI
198 Securing the CredD › Options for securing password fetch operations Kerberos / SSL authentication Password authentication Run the Condor service as a normal account and use NTSSPI
199 Password Authentication
200 Password Authentication › Mutual authentication of Condor daemons possessing a shared “pool password” › Good for small pools where more heavyweight methods aren’t desirable
201 Password Authentication › Pool password can be stored with new “-c” argument to condor_store_cred › Can also be done remotely with “-n” argument C:\> condor_store_cred –c add C:\> condor_store_cred –n crow.cs.wisc.edu –c add
202 Using an Unprivileged Account for Condor
203 Personal Condor › Allows creating a 1-machine Condor pool as any user C:\> SET CONDOR_CONFIG=c:\condor\condor_config C:\> condor_master -f
204 Unprivileged Service › Condor still runs using the Service Control Manager (SCM)
205 More Information › Condor experts here at Condor Week › Condor Manual › Condor Web Site › Condor mailing lists, esp condor-users › Univ of Wisconsin Condor Team at
206 Thank You!