Download presentation
Presentation is loading. Please wait.
Published byAshley Ethan Hancock Modified over 6 years ago
1
Condor Introduction and Architecture for Vanilla Jobs CERN Feb 14 2011
2
The Condor Project (Established ‘85)
Research and Development in the Distributed High Throughput Computing field Team of ~35 faculty, full time staff and students Face software engineering challenges in a distributed UNIX/Linux/NT environment Are involved in national and international grid collaborations Actively interact with academic and commercial entities and users Maintain and support large distributed production environments Educate and train students
3
The Condor Team
4
What does Condor offer to USCMS community?
Batch System Robust feature set, fault tolerant Open Source Development team dedicated to working closely w/ scientific community as priority #1 Flexible!!! Multi-purpose Job Queue / Workflow manager Submit jobs to local cluster Submit jobs to local “campus” Submit jobs to the grid (a.k.a. Condor-G) Grid Overlay Capable Dynamic Condor Glidein pools with the help of a factory (glideinWMS)
5
Outline Condor ClassAds Matchmaking Submitting a hello world job
The “lingua franca” of Condor Matchmaking Submitting a hello world job Condor Architecture Condor daemons – what they are, what they do Who talks to who Condor Connection Brokering
6
Job Condor’s quanta of work Like a UNIX process
Can be an element of a workflow
7
Jobs Have Wants & Needs Jobs state their requirements and preferencs:
I require a Linux/x86 platform Preferences ("Rank"): I prefer the machine with the most memory I prefer a machine in the chemistry department
8
Machines Do Too! Machines specify: Requirements: Preferences ("Rank"):
Require that jobs run only when there is no keyboard activity Never run jobs belonging to Dr. Heisenberg Preferences ("Rank"): I prefer to run Albert’s jobs Custom Attributes: I am a machine in the physics department
9
Condor brings them together
Central Manager (collector, negotiator) Execute Node (startd) Execute Node (startd) condor_submit Submit Node (schedd) Execute Node (startd)
10
Condor ClassAds
11
What are Condor ClassAds?
ClassAds is a language for objects (jobs and machines) to Express attributes about themselves Express what they require/desire in a match (similar to personal classified ads) Structure : set of attribute name/value pairs Value : Literals (string, bool, int, float) or an expression
12
ClassAd Expressions Similar look to C/C++ or Java : operators, references, functions Operators: +, -, *, /, <, <=,>, >=, ==, !=, &&, and || all work as expected Functions: if/then/else, string manipulation, regular expression pattern matching, list operations, dates, randomization, … References: to other attributes in the same ad, or attributes in an ad that is a candidate for a match TRUE==1 and FALSE==0 (guaranteed) (3 == (2+1)) is identical to 1 (TRUE*30) is identical to 30 (3 == 1) is identical to 0 12 12
13
ClassAd Types Condor has many types of ClassAds
A "Job Ad" represents a job to Condor A "Machine Ad" represents a computing resource Others types of ads represent other instances of other services, users, licenses, etc, to your Condor pool
14
The Magic of Matchmaking
Condor evaluates job ads in the context of a candidate machine ads looking for a matches Requirements and Rank expressions Two ads match if both their Requirements expressions evaluate to True MY.name – Value for attribute “name” in local ClassAd TARGET.name – Value for attribute “name” in match candidate ClassAd Name – Looks for “name” in the local ClassAd, then the candidate ClassAd Requirements is a bool; Rank is a float where higher is preferred
15
Example Pet Ad Buyer Ad Type = “Dog” Requirements = DogLover =?= True
Color = “Brown” Price = 75 Sex = "Male" AgeWeeks = 8 Breed = "Saint Bernard" Size = "Very Large" Weight = 27 Buyer Ad AcctBalance = 100 DogLover = True Requirements = (Type == “Dog”) && (TARGET.Price <= MY.AcctBalance) && ( Size == "Large" || Size == "Very Large" ) Rank = (Breed == "Saint Bernard") . . .
16
Getting Started: Submitting Jobs to Condor
Get access to submit host Choose a “Universe” for your job Make your job “batch-ready” Includes making your data available to your job Create a submit description file Run condor_submit to put your job(s) in the queue Relax while Condor manages and watches over your job(s)
17
Choose the job “Universe”
Controls how Condor handles jobs Condors many universes include: Vanilla (aka regular serial job) Parallel Grid Java VM Standard
18
Hello World Submit File
# Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = cosmos ·Job's executable Output = cosmos.out ·Job's STDOUT Input = cosmos.in ·Job's STDIN Queue ·Put the job in the queue!
19
condor_submit & condor_q
% condor_submit sim.submit Submitting job(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: perdita.cs.wisc.edu : < :1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD frieda /16 06: :00:00 I sim.exe 1 jobs; 1 idle, 0 running, 0 held %
20
View the full ClassAd % condor_q -long
-- Submitter: perdita.cs.wisc.edu : < :1027> : MyType = “Job” TargetType = “Machine” ClusterId = 1 QDate = CompletionDate = 0 Owner = “frieda” RemoteWallClockTime = LocalUserCpu = LocalSysCpu = RemoteUserCpu = RemoteSysCpu = ExitStatus = 0 …
21
Logging your Job's Activities
Create a log of job events Add to submit description file: log = cosmos.log The Life Story of a Job Shows all events in the life of a job Good practice to always have a log file Libraries to parse them provided
22
Sample Condor User Log 000 ( ) 05/25 19:10:03 Job submitted from host: < :1816> ... 001 ( ) 05/25 19:12:17 Job executing on host: < :1026> 005 ( ) 05/25 19:13:06 Job terminated. (1) Normal termination (return value 0)
23
condor_status gives information about the pool:
Name OpSys Arch State Activ LoadAv Mem ActvtyTime perdita.cs.wi LINUX INTEL Owner Idle :28:42 coral.cs.wisc LINUX INTEL Claimed Busy :27:21 doc.cs.wisc.e LINUX INTEL Unclaimed Idle :20:04 dsonokwa.cs.w LINUX INTEL Claimed Busy :01:45 ferdinand.cs. LINUX INTEL Claimed Suspe :00:55 To inspect full ClassAds: condor_status -long
24
Condor File Transfer Condor will transfer files between submit and execute nodes (eliminating the need for a shared filesystem) if desired: ShouldTransferFiles YES: Always transfer files to execution site NO: Always rely on a shared filesystem IF_NEEDED: Condor will automatically transfer the files if the submit and execute machine are not in the same FileSystemDomain (Use shared file system if available) When_To_Transfer_Output ON_EXIT: Transfer the job's output files back to the submitting machine only when the job completes ON_EXIT_OR_EVICT: Like above, but also when the job is evicted
25
Condor File Transfer, cont
Transfer_Input_Files List of files that you want Condor to transfer to the execute machine Transfer_Output_Files List of files that you want Condor to transfer from the execute machine If not specified, Condor will transfer back all new or modified files in the execute directory
26
Simple File Transfer Example
# Example submit file using file transfer Universe = vanilla Executable = cosmos Log = cosmos.log ShouldTransferFiles = YES Transfer_input_files = cosmos.dat Transfer_output_files = results.dat When_To_Transfer_Output = ON_EXIT Queue
27
General User Commands condor_status View Pool Status
condor_q View Job Queue condor_submit Submit new Jobs condor_run Submit and block condor_rm Remove Jobs condor_prio Intra-User Prios condor_history Completed Job Info condor_submit_dag Submit new DAG condor_checkpoint Force a checkpoint condor_compile Link Condor library
28
Condor Daemons Title unknown, by Hans Holbein the Younger, from Historiarum Veteris Testamenti icones, 1543 28
29
Condor Daemons – Mix’n Match Components
negotiator collector master shadow schedd procd startd starter kbdd exec
30
condor_master You start it, it starts up the other Condor daemons
If a daemon exits unexpectedly, restarts deamon and s administrator If a daemon binary is updated (timestamp changed), restarts the daemon 30 30
31
condor_master Provides access to many remote administration commands:
condor_reconfig, condor_restart, condor_off, condor_on, etc. Default server for many other commands: condor_config_val, etc. 31 31
32
condor_master Periodically runs condor_preen to clean up any files Condor might have left on the machine s you notification of deleted files Backup behavior, the other daemons clean up after themselves 32 32
33
condor_procd Tracks processes Automatically started as needed
No DAEMON_LIST entry necessary Behind the scenes Part of privilege separation security enhancements “IMG 0960” by Eva Schiffer © 2008 Used with permission 33 33
34
condor_startd Represents a machine willing to run jobs to the Condor pool Run on any machine you want to run jobs on Enforces the wishes of the machine owner (the owner’s “policy”) 34 34
35
condor_startd Starts, stops, suspends jobs
Spawns the appropriate condor_starter, depending on the type of job Provides other administrative commands (for example, condor_vacate) Aided by condor_kbdd 35 35
36
condor_starter Spawned by the condor_startd
Don’t add to DAEMON_LIST Handles all the details of starting and managing the job Transfer job’s binary to execute machine Send back exit status Etc. 36 36
37
condor_starter One per running job
The default configuration is willing to run one job per CPU 37 37
38
condor_kbdd Monitors physical keyboard and mouse so the condor_startd can make decisions based on local usage.
39
condor_schedd Represents jobs to the Condor pool
Maintains persistent queue of jobs Queue is not strictly first-in-first-out (priority based) Each machine running condor_schedd maintains its own independent queue Run on any machine you want to submit jobs from 39 39
40
condor_schedd Responsible for contacting available machines and spawning waiting jobs When told to by condor_negotiator Services most user commands: condor_submit, condor_rm, condor_q Also: condor_hold, condor_release 40 40
41
condor_shadow Represents job on the submit machine
Spawned by condor_schedd Don’t add to DAEMON_LIST Services requests from standard universe jobs for remote system calls including all file I/O Makes decisions on behalf of the job for example: where to store the checkpoint file 41 41
42
condor_exec.exe A running job.
When user executable binaries are transferred to the execution side, they are renamed condor_exec.exe.
43
condor_collector Collects information from all other Condor daemons in the pool condor_collector Each daemon sends a periodic update called a ClassAd to the collector Old ClassAds removed after a time out Services queries for information: Queries from other Condor daemons Queries from users (condor_status) 43 43
44
condor_negotiator Performs matchmaking in Condor
Pulls list of available machines and job queues from condor_collector Matches jobs with available machines Both the job and the machine must satisfy each other’s requirements (2-way matching) Handles user priorities and accounting 44 44
45
Machine role defined by services launched there
You only have to run the daemons for the services you need to provide DAEMON_LIST is a comma separated list of daemons to start DAEMON_LIST=MASTER,SCHEDD,STARTD 45 45
46
Central Manager The Central Manager is the machine running the collector and negotiator DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR Defines a Condor pool. 46 46
47
Sample Condor Pool Execute-Only Central Manager master master startd
= Process Spawned Central Manager master collector negotiator schedd startd = ClassAd Communication Pathway Execute-Only master startd Submit-Only master schedd Regular Node schedd startd master Regular Node schedd startd master Here we have a small pool with six machines. Like most Condor pools, there is a Central Manager, identifiable by the presence of the condor_negotiator and condor_collector. In this case the Central Manger is also allowed to run jobs (presence of condor_startd), and jobs can be submitted from the Central Manager (presence of condor_schedd) This pool has one node only useful for submitting jobs (it is only running the condor_schedd), and two nodes only used for executing jobs (only running condor_startd). There are two nodes both able to submit and execute jobs (both startd and schedd) 47 47
48
Job Startup 48 “LUNAR Launch” by Steve Jurvertson (“jurvetson”) © 2006
Licensed under the Creative Commons Attribution 2.0 license. 48
49
Claiming Protocol Q Central Manager J S S Submit Machine
Negotiator Collector J S S Submit Machine Execute Machine CLAIM J S S S Q J J Schedd Startd Steps 1. Startd sends collector ClassAd describing itself. (The Schedd does as well, but it has nothing interesting to say yet.) 2. The user calls condor_submit to submit a job. The job is handed off to the schedd and condor_submit returns. 3. The schedd alerts the collector that it now has a job waiting. 4. The negotiator asks the collector for a list machines able to run jobs and schedd queues with waiting jobs. 5. The negotiator contacts the schedd to learn about the waiting job. 6. The negotiator matches the waiting job with the waiting machine. 7. The negotiator alerts the schedd and the startd that there is a match. 8. The schedd contacts the startd to claim the match. 9. The schedd starts a shadow to monitor the job. 10. The startd starts a starter to start the job. 11. The starter and the shadow contact each other. 11. The starter starts the job. 12. If the job is using the Condor syscall library (typically through being condor_compiled), it will contact the shadow to access necessary files. J Submit 49 49
50
Claim Activation Central Manager Submit Machine Execute Machine
Negotiator Collector Submit Machine Execute Machine CLAIMED Schedd Startd Steps 1. Startd sends collector ClassAd describing itself. (The Schedd does as well, but it has nothing interesting to say yet.) 2. The user calls condor_submit to submit a job. The job is handed off to the schedd and condor_submit returns. 3. The schedd alerts the collector that it now has a job waiting. 4. The negotiator asks the collector for a list machines able to run jobs and schedd queues with waiting jobs. 5. The negotiator contacts the schedd to learn about the waiting job. 6. The negotiator matches the waiting job with the waiting machine. 7. The negotiator alerts the schedd and the startd that there is a match. 8. The schedd contacts the startd to claim the match. 9. The schedd starts a shadow to monitor the job. 10. The startd starts a starter to start the job. 11. The starter and the shadow contact each other. 11. The starter starts the job. 12. If the job is using the Condor syscall library (typically through being condor_compiled), it will contact the shadow to access necessary files. Activate Claim Shadow Starter Job 50 50
51
Repeat until Claim released
Central Manager Negotiator Collector Submit Machine Execute Machine CLAIMED Schedd Startd Steps 1. Startd sends collector ClassAd describing itself. (The Schedd does as well, but it has nothing interesting to say yet.) 2. The user calls condor_submit to submit a job. The job is handed off to the schedd and condor_submit returns. 3. The schedd alerts the collector that it now has a job waiting. 4. The negotiator asks the collector for a list machines able to run jobs and schedd queues with waiting jobs. 5. The negotiator contacts the schedd to learn about the waiting job. 6. The negotiator matches the waiting job with the waiting machine. 7. The negotiator alerts the schedd and the startd that there is a match. 8. The schedd contacts the startd to claim the match. 9. The schedd starts a shadow to monitor the job. 10. The startd starts a starter to start the job. 11. The starter and the shadow contact each other. 11. The starter starts the job. 12. If the job is using the Condor syscall library (typically through being condor_compiled), it will contact the shadow to access necessary files. Activate Claim Shadow Starter Job 51 51
52
Repeat until Claim released
Central Manager Negotiator Collector Submit Machine Execute Machine CLAIMED Schedd Startd Steps 1. Startd sends collector ClassAd describing itself. (The Schedd does as well, but it has nothing interesting to say yet.) 2. The user calls condor_submit to submit a job. The job is handed off to the schedd and condor_submit returns. 3. The schedd alerts the collector that it now has a job waiting. 4. The negotiator asks the collector for a list machines able to run jobs and schedd queues with waiting jobs. 5. The negotiator contacts the schedd to learn about the waiting job. 6. The negotiator matches the waiting job with the waiting machine. 7. The negotiator alerts the schedd and the startd that there is a match. 8. The schedd contacts the startd to claim the match. 9. The schedd starts a shadow to monitor the job. 10. The startd starts a starter to start the job. 11. The starter and the shadow contact each other. 11. The starter starts the job. 12. If the job is using the Condor syscall library (typically through being condor_compiled), it will contact the shadow to access necessary files. Activate Claim Shadow Starter Job 52 52
53
When is claim released? When relinquished by one of the following
lease on the claim is not renewed Why? Machine powered off, disappeared, etc schedd Why? Out of jobs, shutting down, schedd didn’t “like” the machine, etc startd Why? Policy re claim lifetime, prefers a different match (via Rank), non-dedicated desktop, etc negotiator Why? User priority inversion policy explicitly via a command-line tool E.g. condor_vacate
54
Some items to notice Machines (startds) or submitters (schedds) can dynamically appear and disappear A key for glidein Scheduling policy can be very flexible (custom attributes) and very distributed Lots of network arrows on previous slides Reflects the P2P nature of Condor But what about NATs, firewalls ?
55
CCB: Condor Connection Broker
Condor wants two-way p2p connectivity With CCB, one-way is good enough Collector requests reversed connections for clients Execute Node Job Submit Point run this job I want to connect to the submit node transfer files reversed connection CCB_ADDRESS=ccb.host.name
56
Limitations of CCB Execute Node Job Submit Point no go!
Collector (CCB Broker) needs to be accessible by everyone Requires outgoing connectivity Can’t have BOTH submit and execute points behind different firewalls Execute Node CCB_ADDRESS=ccb1.host CCB_ADDRESS=ccb2.host Job Submit Point no go!
57
Another Submit File Example
# Example submit file using file transfer Universe = vanilla Log = cosmos.log Executable = cosmos # Do each run in its own Subdirectory Initialdir = Run_$(Process) # Move files if no shared volume avail ShouldTransferFiles = IF_NEEDED Transfer_input_files = cosmos.dat Transfer_output_files = results.dat When_To_Transfer_Output = ON_EXIT # Data dir is advertised by the machines Requirements = Memory > 2000 Arguments = -datadir $$(CosmosData) Rank = *KFlops + Memory # Run 1000 different data sets Queue 1000
58
Questions? Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.